# **Feature Engineering**

### Objectives

* Engineer features for Classification, Regression and Cluster models

### Inputs

* outputs/datasets/cleaned/test_set.csv
* outputs/datasets/cleaned/train_set.csv

### Outputs

* generate a list with variables to engineer

### Additional Comments

* This file and its contents were inspired by the Churnometer Walkthrough Project 2. 
The code has been adapted and extended to analyze housing prices in Ames, Iowa, focusing on 
predictive analytics and insights related to property attributes and sales price.

### Change working directory

We need to change the working directory from its current folder to its parent folder

* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory.

* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

### Load Cleaned Data

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/train_set.csv"
train_set = pd.read_csv(train_set_path)
train_set.head(3)

Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/test_set.csv"
test_set = pd.read_csv(test_set_path)
test_set.head(3)

### Data Exploration

We use the `ProfileReport` from `ydata_profiling` to perform an initial exploratory data analysis on the training dataset. This report provides insights into the dataset, such as missing values, distribution of variables, and possible correlations, helping us determine appropriate feature engineering transformations.


In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=train_set, minimal=True)
pandas_report.to_notebook_iframe()

### Feature Engineering

* In this section, we will analyze and transform the features in our dataset. We will utilize functions introduced in the feature-engine lesson.


In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings("ignore")


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ["numerical", "ordinal_encoder", "outlier_winsorizer"]
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == "numerical":
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == "ordinal_encoder":
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == "outlier_winsorizer":
        list_column_transformers = ["iqr"]

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include="category").columns:
        df_feat_eng[col] = df_feat_eng[col].astype("object")

    if analysis_type == "numerical":
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == "outlier_winsorizer":
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == "ordinal_encoder":
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != "ordinal_encoder":
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  "#432371"], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title("Histogram")
    axes[1].set_title("QQ Plot")
    axes[2].set_title("Boxplot")
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method="arbitrary", variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method="iqr", tail="both", fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base="10")
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

### Feature Engineering Spreadsheet Summary

* Transformer that will be used: 
    * Categorical Encoding
    * Numerical Transformation
    * Smart Correlation Selection

### Categorical Encoding

1. Define a variable containing the names of the categorical variables.

In [None]:
categorical_variables = list(train_set.select_dtypes(["object","category"]).columns)
categorical_variables

2. Create a DataFrame from a subset of the Training set using the defined variable.

In [None]:
df_categorical = train_set[categorical_variables].copy()
df_categorical.head(3)

Replace missing value

In [None]:
for column in df_categorical.columns:
    mode_value = df_categorical[column].mode()[0]
    df_categorical[column].fillna(mode_value, inplace=True)


3. Applie an ordinal encoding transformation to the categorical columns in the DataFrame:

In [None]:
df_categorical_engineered = FeatureEngineeringAnalysis(df=df_categorical, analysis_type='ordinal_encoder')

### Numerical Transformation

1.  Select variables with numerical variable names:

In [None]:
numerical_variables = list(train_set.select_dtypes(["int64","float64"]).columns)
numerical_variables

2. Create a separate DataFrame:

In [None]:
df_numerical = train_set[numerical_variables].copy()
df_numerical.head(3)

3. Create engineered variables2 by applying the transformation:

In [None]:
df_numerical_engineered = FeatureEngineeringAnalysis(df=df_numerical, analysis_type="numerical")

### Feature Engineering Conclusion

- **Variable: `1stFlrSF`**
  - **Applied Transformations:** `log_e`, `log_10`, `reciprocal`, `power`, `box_cox`, `yeo_johnson`
  - **Conclusion:** None of the applied transformations significantly improved the boxplot distribution or QQ plot.

- **Variable: `2ndFlrSF`**
  - **Applied Transformations:** `power`, `yeo_johnson`
  - **Conclusion:** Similarly, the applied transformations did not effectively improve the distribution based on boxplot and QQ plot analysis.

**Overall:** The transformations applied to these variables were not effective in normalizing the distributions or reducing skewness.


4. Apply the transformation to the Train and Test datasets.

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine import transformation as tf

data_pipeline = Pipeline([
    ("LogTransform", tf.LogTransformer(variables=["1stFlrSF", "LotArea", "GrLivArea"])),
    ("PowerTransform", tf.PowerTransformer(variables=["MasVnrArea"])),
    ("YeoJohnsonTransform", tf.YeoJohnsonTransformer(variables=["OpenPorchSF"]))
])


train_set = data_pipeline.fit_transform(train_set)

test_set = data_pipeline.transform(test_set)

train_set.head(3)

### SmartCorrelatedSelection Variables

1.  We will remove the SalePrice column since our goal is to develop a model that predicts this value.

In [None]:
df_temp = train_set.drop(["SalePrice"],axis=1)
df_temp.head(3)

2. Create a separate DataFrame

In [None]:
df_engineering = df_temp.copy()
df_engineering.head(3)

3. This code identifies and removes highly correlated numerical columns based on a Spearman correlation threshold of 0.60.

In [None]:
import pandas as pd
import numpy as np

df_engineering = df_temp.copy()

numerical_df = df_engineering.select_dtypes(include=['number'])

corr_matrix = numerical_df.corr(method='spearman').abs()

upper_tri = corr_matrix.where(
    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
)

to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.60)]

df_engineering = df_engineering.drop(columns=to_drop)

to_drop


### Conclusion and Steps to Follow

#### Feature Engineering Transformers

- **Ordinal Categorical Encoding**:
  - Applied to variables: `BsmtExposure`, `BsmtFinType1`, `GarageFinish`, `KitchenQual` to convert them into numerical values for the model.

#### Strongest Correlated Variables
- Based on the sale_price_study, the following features showed the strongest correlation with `SalePrice`:
  - `1stFlrSF`, `GarageArea`, `GrLivArea`, `OverallQual`, `YearBuilt`.

#### Manual Correlation Selection
- We manually identified and removed highly correlated features to reduce redundancy in the dataset:
  - **Features Dropped**: `2ndFlrSF`, `GarageYrBlt`, `OverallQual`, `TotalBsmtSF`.
- **Correlation Method and Threshold**:
  - We calculated the **Spearman correlation** matrix and set a threshold of **0.60** to identify highly correlated pairs.
  - For each pair of features with correlation above the threshold, one feature was removed to minimize multicollinearity.
  - This process was done without automated selection methods, allowing precise control over which features were retained in the final dataset.

#### Final Note
- After applying transformations and feature selection, we have prepared the dataset for model training. The final feature set consists of minimally correlated variables, numerically transformed to fit the requirements for building a robust machine learning model.
