# Notebook 04 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/test.csv

## Outputs
* Create Clean dataset:
    * all new datasets of cleaning will be stored in inputs/datasets/cleaning
* Split created dataset in to 3 parts:
    * Train
    * Validate
    * Test
* all new datasets (train, validate and test) will be stored in outputs/datasets/cleaned

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print('Current working directory is', os.getcwd())

Current working directory is /Users/pecukevicius/DataspellProjects/heritage_houses_p5


## Loading Dataset

In [3]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
618,1828,0,2,Av,48,Unf,1774,0,774,Unf,...,90,452,108,5,9,1822,0,2007,2007,314813
870,894,0,2,No,0,Unf,894,0,308,Unf,...,60,0,0,5,5,894,0,1962,1962,109500
92,964,0,2,No,713,ALQ,163,0,432,Unf,...,80,0,0,7,5,876,0,1921,2006,163500
817,1689,0,3,No,1218,GLQ,350,0,857,RFn,...,70,148,59,5,8,1568,0,2002,2002,271000
302,1541,0,3,No,0,Unf,1541,0,843,RFn,...,118,150,81,5,7,1541,0,2001,2002,205000


## Data Exploration

Hypothesis 2 also Failed. There is possibility, where features interact between themselves making new ones, same time we can extract useful information from existing features.
1. Encoding Changing (create dictionary for ordinal Encoder):
    * When we encode Basement Exposure and Finish type, None becomes 0, and it is fine as There is no basement.
    * When we encode Garage Finish, same issue, None becomes 0, there is no Garage
    * Kitchen Quality - Po (Poor) becomes 0, what is wrong. What if it has to be positive or negative number, it interacts with others like:
2. Create new mathematical sub_features:
    * Basement:
        * Basement Exposure mathematical manipulations with all Basement Areas
        * Basement Finish Type manipulations with all Basement Areas
    * Garage:
        * Garage Finish mathematical manipulations with Garage Area
    * Building:
        * Overall Cond mathematical manipulations with building areas
        * Overall Quality mathematical manipulations with building areas
3. Extract information and create new sub_features (we know buildings dates are up to 2010):
    * Garage Age = 2010 - Garage Year Built
    * Building Age = 2010 - Year Built
    * Remod Age = 2010 - Remodel Year
    * Remod Age Test = If House was built and remodeled same year, this vale will be 0, else Remod Age
4. Checking if house feature exist (maybe garage, porch or deck size does not matter, it mater that it is there):
    * Has 2nd floor - If area of 2nd floor > 0, we will set to True, else False
    * Has Basement - If building has basement = True, else False
    * Has Garage - If building has Garage = True, else False
    * Has Masonry Veneer - If building has masonry veneer = True, else False
    * Has Enclosed Porch - If building has Enclosed Porch = True, else False
    * Has Open Porch - If building has Open Porch = True, else False
    * Has Any Porch - If building has any type of porch = True, else False
    * Has Wooden Deck - If building Has wooden deck = True, else False

After new features created, check any correlation with existing features and new ones.

## Feature Engineering

### Categorical Features Encoding

1. We will set encoder for values, so when we encode categorical features, they receive correct, or at least logical numbers
2. We will add one more encoder with OneHotEncoder, so we can compare how they increase or decrease performance of model

In [4]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Encoding Order as specified

# Getting all categorical features as a list
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

""" For Kitchen Quality we will add 'NONE', otherwise encoding Po will be assigned 0"""
order = {
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],
    'KitchenQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
}

# Initialize the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[order['BsmtExposure'],
                                     order['BsmtFinType1'],
                                     order['GarageFinish'],
                                     order['KitchenQual']])

# Fit and Transform the data
df[categorical_features] = encoder.fit_transform(df[categorical_features])
df[categorical_features] = pd.DataFrame(df, columns=categorical_features)

### Basement Features

First we will create new sub features using RelativeFeatures

In [5]:
from feature_engine.creation import RelativeFeatures

basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
transformer = RelativeFeatures(
    variables=['BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF'],
    reference=['BsmtExposure', 'BsmtFinType1'],
    func=["sub", "mul", "add"],  # We will try to subtract, multiply and add - sum features
)
df_basement = transformer.fit_transform(df[basement_features])
df_basement.head()

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,BsmtFinSF1_sub_BsmtExposure,BsmtUnfSF_sub_BsmtExposure,TotalBsmtSF_sub_BsmtExposure,BsmtFinSF1_sub_BsmtFinType1,BsmtUnfSF_sub_BsmtFinType1,...,TotalBsmtSF_mul_BsmtExposure,BsmtFinSF1_mul_BsmtFinType1,BsmtUnfSF_mul_BsmtFinType1,TotalBsmtSF_mul_BsmtFinType1,BsmtFinSF1_add_BsmtExposure,BsmtUnfSF_add_BsmtExposure,TotalBsmtSF_add_BsmtExposure,BsmtFinSF1_add_BsmtFinType1,BsmtUnfSF_add_BsmtFinType1,TotalBsmtSF_add_BsmtFinType1
618,3.0,1.0,48,1774,1822,45.0,1771.0,1819.0,47.0,1773.0,...,5466.0,48.0,1774.0,1822.0,51.0,1777.0,1825.0,49.0,1775.0,1823.0
870,1.0,1.0,0,894,894,-1.0,893.0,893.0,-1.0,893.0,...,894.0,0.0,894.0,894.0,1.0,895.0,895.0,1.0,895.0,895.0
92,1.0,5.0,713,163,876,712.0,162.0,875.0,708.0,158.0,...,876.0,3565.0,815.0,4380.0,714.0,164.0,877.0,718.0,168.0,881.0
817,1.0,6.0,1218,350,1568,1217.0,349.0,1567.0,1212.0,344.0,...,1568.0,7308.0,2100.0,9408.0,1219.0,351.0,1569.0,1224.0,356.0,1574.0
302,1.0,1.0,0,1541,1541,-1.0,1540.0,1540.0,-1.0,1540.0,...,1541.0,0.0,1541.0,1541.0,1.0,1542.0,1542.0,1.0,1542.0,1542.0


Now Using SmartCorrelatedSelection we will identify sets of them, so we do not need to work with all sub_features

In [6]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.8,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_basement)

basement_feature_sets = tr.correlated_feature_sets_
basement_feature_sets

[{'BsmtExposure', 'TotalBsmtSF_mul_BsmtExposure'},
 {'BsmtFinType1', 'TotalBsmtSF_mul_BsmtFinType1'},
 {'BsmtFinSF1',
  'BsmtFinSF1_add_BsmtExposure',
  'BsmtFinSF1_add_BsmtFinType1',
  'BsmtFinSF1_mul_BsmtFinType1',
  'BsmtFinSF1_sub_BsmtExposure',
  'BsmtFinSF1_sub_BsmtFinType1'},
 {'BsmtUnfSF',
  'BsmtUnfSF_add_BsmtExposure',
  'BsmtUnfSF_add_BsmtFinType1',
  'BsmtUnfSF_sub_BsmtExposure',
  'BsmtUnfSF_sub_BsmtFinType1'},
 {'TotalBsmtSF',
  'TotalBsmtSF_add_BsmtExposure',
  'TotalBsmtSF_add_BsmtFinType1',
  'TotalBsmtSF_sub_BsmtExposure',
  'TotalBsmtSF_sub_BsmtFinType1'}]

Very nice, we can see sets, based on that we will select just what we need

In [7]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_basement[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


Selected features: ['TotalBsmtSF_mul_BsmtExposure', 'TotalBsmtSF_mul_BsmtFinType1', 'BsmtFinSF1_mul_BsmtFinType1', 'BsmtUnfSF_sub_BsmtFinType1', 'TotalBsmtSF_add_BsmtFinType1']


We can see, that best features and their combinations are:
1. TotalBsmtSF * BsmtExposure => Yes it looks good and logical
2. TotalBsmtSF * BsmtFinType1 => Also logical
3. BsmtFinSF1 * BsmtFinType1 => Very Logical
4. BsmtUnfSF - BsmtFinType1 => Doubt it, it is unfinished area minus finish type 
5. TotalBsmtSF + BsmtFinType1 => also not very Logical

We will make new sub Features like this (will add to all new sub_features xxx at start, this will help to identify them):
```python
df['xxx_BsmtFinType1_BsmtFinSF1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
df['xxx_BsmtExposure_TotalBsmtSF'] = df['BsmtExposure'] * df['TotalBsmtSF']
df['xxx_GarageArea_GarageFinish'] = df['GarageArea'] * df['GarageFinish']
```