# **Feature Engineering**

### Objectives

* Engineer features for Classification, Regression and Cluster models

### Inputs

* outputs/datasets/cleaned/test_set.csv
* outputs/datasets/cleaned/train_set.csv

### Outputs

* generate a list with variables to engineer

### Additional Comments

* This file and its contents were inspired by and adapted from the Churnometer Walkthrough Project 2.  

### Change working directory

We need to change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/housing-prices/jupyter_notebooks'

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/housing-prices'

---

### Load Cleaned Data

Train Set

In [4]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/train_set.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,1828,0,0,Av,48,Unk,1774,774,Unf,2007,...,11694,90,452,108,5,9,1822,2007,2007,314813
1,894,0,2,No,0,Unf,894,308,Unf,1962,...,6600,60,0,0,5,5,894,1962,1962,109500
2,964,0,2,No,713,ALQ,163,432,Unf,1921,...,13360,80,0,0,7,5,876,1921,2006,163500


Test Set

In [5]:
test_set_path = "outputs/datasets/cleaned/test_set.csv"
TestSet = pd.read_csv(test_set_path)
TestSet.head(3)

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd,SalePrice
0,2515,0,4,No,1219,Rec,816,484,,1975,...,32668,69,0,0,3,6,2035,1957,1975,200624
1,958,620,3,No,403,BLQ,238,240,Unf,1941,...,9490,79,0,0,7,6,806,1941,1950,133000
2,979,224,3,No,185,LwQ,524,352,Unf,1950,...,7015,69,161,0,4,5,709,1950,1950,110000


### Data Exploration

We use the `ProfileReport` from `ydata_profiling` to perform an initial exploratory data analysis on the training dataset. This report provides insights into the dataset, such as missing values, distribution of variables, and possible correlations, helping us determine appropriate feature engineering transformations.


In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

### Feature Engineering

* In this section, we will analyze and transform the features in our dataset. We will utilize functions introduced in the feature-engine lesson.


In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings("ignore")


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ["numerical", "ordinal_encoder", "outlier_winsorizer"]
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == "numerical":
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == "ordinal_encoder":
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == "outlier_winsorizer":
        list_column_transformers = ["iqr"]

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include="category").columns:
        df_feat_eng[col] = df_feat_eng[col].astype("object")

    if analysis_type == "numerical":
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == "outlier_winsorizer":
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == "ordinal_encoder":
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != "ordinal_encoder":
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  "#432371"], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title("Histogram")
    axes[1].set_title("QQ Plot")
    axes[2].set_title("Boxplot")
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method="arbitrary", variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method="iqr", tail="both", fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base="10")
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Feature Engineering Spreadsheet Summary

* Transformer that will be used: 
    * Categorical Encoding
    * Numerical Transformation
    * Smart Correlation Selection

### Categorical Encoding

1. Define a variable containing the names of the categorical variables.

In [None]:
categorical_variables = list(TrainSet.select_dtypes(["object","category"]).columns)
categorical_variables


2. Create a DataFrame from a subset of the Training set using the defined variable.

In [None]:
df_categorical = TrainSet[categorical_variables].copy()
df_categorical.head()

3. Applie an ordinal encoding transformation to the categorical columns in the DataFrame:

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_categorical, analysis_type="ordinal_encoder")

4. Apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method="arbitrary", variables = categorical_variables)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### Numerical Transformation

1.  Select variables with numerical variable names:

In [None]:
numerical_variables = list(TrainSet.select_dtypes(["int64","float64"]).columns)
numerical_variables

2. Create a separate DataFrame:

In [None]:
df_numerical = TrainSet[numerical_variables].copy()
df_numerical.head(3)

3. Create engineered variables2 by applying the transformation:

In [None]:
df_numerical_engineered = FeatureEngineeringAnalysis(df=df_numerical, analysis_type="numerical")

### Feature Engineering Conclusion

- **Variable: `1stFlrSF`**
  - **Applied Transformations:** `log_e`, `log_10`, `reciprocal`, `power`, `box_cox`, `yeo_johnson`
  - **Conclusion:** None of the applied transformations significantly improved the boxplot distribution or QQ plot.

- **Variable: `2ndFlrSF`**
  - **Applied Transformations:** `power`, `yeo_johnson`
  - **Conclusion:** Similarly, the applied transformations did not effectively improve the distribution based on boxplot and QQ plot analysis.

**Overall:** The transformations applied to these variables were not effective in normalizing the distributions or reducing skewness.


4. Apply the transformation to the Train and Test datasets.

In [None]:
# This code has been inspired by or adapted from GitHub repository by sashg91:
# Repository Link: https://github.com/SashG91/Heritage-Housing-Issues-PP5
# Specifically, the numerical transformation pipeline using LogTransformer, PowerTransformer, and YeoJohnsonTransformer.

from sklearn.pipeline import Pipeline
from feature_engine import transformation as vt

pipeline = Pipeline([
    ("NumericLogTransform", vt.LogTransformer(variables=["1stFlrSF", "LotArea", "GrLivArea"])),
    ("NumericPowerTransform", vt.PowerTransformer(variables=["MasVnrArea"])),
    ("NumericYeoJohnsonTransform", vt.YeoJohnsonTransformer(variables=["OpenPorchSF"]))
])

TrainSet = pipeline.fit_transform(TrainSet)
TestSet = pipeline.transform(TestSet)

print("* The numerical transformation has been completed!")


In [None]:
TrainSet.head(3)

### SmartCorrelatedSelection Variables

1.  We will remove the SalePrice column since our goal is to develop a model that predicts this value.

In [None]:
df_temp = TrainSet.drop(["SalePrice"],axis=1)
df_temp.head(3)

In [None]:
df_engineering = TrainSet.copy()
df_engineering.head(3)

4. Apply transformations to create new engineered features that improve the model"s performance.

In [None]:
import pandas as pd
import numpy as np

# Step 1: Calculate the correlation matrix
correlation_matrix = df_engineering.corr(method="spearman")

# Step 2: Identify pairs of highly correlated features (above a threshold of 0.6)
threshold = 0.6
correlated_features = set()

for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname_i = correlation_matrix.columns[i]
            colname_j = correlation_matrix.columns[j]
            correlated_features.add((colname_i, colname_j))

# Step 3: Create groups of correlated features and decide which ones to drop
features_to_drop = set()
features_to_keep = set()

for feature_pair in correlated_features:
    # Select the feature with the higher variance to keep
    feature_i, feature_j = feature_pair
    var_i = df_engineering[feature_i].var()
    var_j = df_engineering[feature_j].var()
    
    if var_i >= var_j:
        features_to_drop.add(feature_j)
        features_to_keep.add(feature_i)
    else:
        features_to_drop.add(feature_i)
        features_to_keep.add(feature_j)

# Step 4: Drop the redundant features from the dataframe
df_engineering = df_engineering.drop(columns=list(features_to_drop))

# Step 5: Print the final columns after dropping
print("Remaining features after dropping correlated features:")
print(df_engineering.columns)


In [None]:
features_to_drop

### Conclusion and Steps to Follow

#### Feature Engineering Transformers

- **Ordinal Categorical Encoding**:
  - Applied to variables: `BsmtExposure`, `BsmtFinType1`, `GarageFinish`, `KitchenQual` to convert them into numerical values for the model.

- **Numerical Transformation**:
  - The following transformations were applied to improve data distribution:
    - **Logarithmic Transformation (Log e, Log 10)**: Applied to `1stFlrSF`, `LotArea`, and `GrLivArea`.
    - **Box Cox and Yeo-Johnson Transformations**: Considered for `GrLivArea` and `OpenPorchSF` respectively.
    - **Power Transformation**: Applied to `GarageArea` and `MasVnrArea`.
  - Note that `SalePrice` was excluded from transformations as it is our target variable.

#### Strongest Correlated Variables
- Based on the sale_price_study, the following features showed the strongest correlation with `SalePrice`:
  - `1stFlrSF`, `GarageArea`, `GrLivArea`, `OverallQual`, `TotalBsmtSF`, `YearBuilt`.

#### Smart Correlation Selection
- We used `SmartCorrelatedSelection` to eliminate features with high redundancy:
  - **Features Dropped**: `2ndFlrSF`, `GarageYrBlt`, `OverallQual`, `TotalBsmtSF`.
- **Correlation Methods and Selection**:
  - **Spearman**:
    - **Cardinality Selection**: Dropped `2ndFlrSF`, `GarageYrBlt`, `OverallQual`, and `TotalBsmtSF`.
    - **Variance Selection**: Dropped `1stFlrSF`, `GarageArea`, `GrLivArea`, `OverallQual`.
  - **Pearson**:
    - **Cardinality Selection**: Dropped `2ndFlrSF`, `GarageYrBlt`, `TotalBsmtSF`.
    - **Variance Selection**: Dropped `1stFlrSF`, `GarageArea`, `GrLivArea`.

#### Final Note
- After applying transformations and feature selection, we have prepared the dataset for model training. The final feature set consists of minimally correlated variables, numerically transformed to fit the requirements for building a robust machine learning model.
