# Notebook 08 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/train.parquet.gzip

## Outcome:

All Features and Transformations for them

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print('Current working directory is', os.getcwd())

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

## Data Exploration

Hypothesis 2 also Failed. There is possibility, where features interact between themselves making new ones, same time we can extract useful information from existing features.
1. Encoding Changing (create dictionary for ordinal Encoder):
    * When we encode Basement Exposure and Finish type, None becomes 0, and it is fine as There is no basement.
    * When we encode Garage Finish, same issue, None becomes 0, there is no Garage
    * Kitchen Quality - Po (Poor) becomes 0, what is wrong. What if it has to be positive or negative number, it interacts with others like:
2. Create new mathematical sub_features:
    * Basement:
        * Basement Exposure mathematical manipulations with all Basement Areas
        * Basement Finish Type manipulations with all Basement Areas
    * Garage:
        * Garage Finish mathematical manipulations with Garage Area
    * Building:
        * Overall Cond mathematical manipulations with building areas
        * Overall Quality mathematical manipulations with building areas
3. Extract information and create new sub_features (we know buildings dates are up to 2010):
    * Garage Age = 2010 - Garage Year Built
    * Building Age = 2010 - Year Built
    * Remod Age = 2010 - Remodel Year
    * Remod Age Test = If House was built and remodeled same year, this vale will be 0, else Remod Age
4. Checking if house feature exist (maybe garage, porch or deck size does not matter, it mater that it is there):
    * Has 2nd floor - If area of 2nd floor > 0, we will set to True, else False
    * Has Basement - If building has basement = True, else False
    * Has Garage - If building has Garage = True, else False
    * Has Masonry Veneer - If building has masonry veneer = True, else False
    * Has Enclosed Porch - If building has Enclosed Porch = True, else False
    * Has Open Porch - If building has Open Porch = True, else False
    * Has Any Porch - If building has any type of porch = True, else False
    * Has Wooden Deck - If building Has wooden deck = True, else False

After new features created, check any correlation with existing features and new ones.
* All new features will have prefix NF_

## Feature Engineering

### Categorical Features Encoding

1. We will set encoder for values, so when we encode categorical features, they receive correct, or at least logical numbers
2. We will add one more encoder with OneHotEncoder, so we can compare how they increase or decrease performance of model

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Encoding Order as specified

# Getting all categorical features as a list
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

""" For Kitchen Quality we will add 'NONE', otherwise encoding Po will be assigned 0"""
order = {
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],
    'KitchenQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
}

# Initialize the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[order['BsmtExposure'],
                                     order['BsmtFinType1'],
                                     order['GarageFinish'],
                                     order['KitchenQual']])

# Fit and Transform the data
df[categorical_features] = encoder.fit_transform(df[categorical_features])
df[categorical_features] = pd.DataFrame(df, columns=categorical_features)

### Basement Features

First we will create new sub features using RelativeFeatures

In [None]:
from feature_engine.creation import RelativeFeatures

basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
transformer = RelativeFeatures(
    variables=['BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF'],
    reference=['BsmtExposure', 'BsmtFinType1'],
    func=["sub", "mul", "add"],  # We will try to subtract, multiply and add - sum features
)
df_basement = transformer.fit_transform(df[basement_features])
df_basement.head()

Now Using SmartCorrelatedSelection we will identify sets of them, so we do not need to work with all sub_features

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.8,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_basement)

basement_feature_sets = tr.correlated_feature_sets_
basement_feature_sets

Very nice, we can see sets, based on that we will select just what we need

In [None]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_basement[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


We can see, that best features and their combinations are:
1. TotalBsmtSF * BsmtExposure => Yes it looks good and logical
2. TotalBsmtSF * BsmtFinType1 => Also logical
3. BsmtFinSF1 * BsmtFinType1 => Very Logical
4. BsmtUnfSF - BsmtFinType1 => Doubt it, it is unfinished area minus finish type 
5. TotalBsmtSF + BsmtFinType1 => also not very Logical

We will make new sub Features like this (will add to all new sub_features xxx at start, this will help to identify them):
```python
df['NF_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['NF_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['NF_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
```

In [None]:
df['NF_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['NF_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['NF_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']

### Garage Features

In [None]:
df['NF_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']

Will add code to creating new sub_features:
```python
df['NF_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']
```

### Building sub_features:

Now this is extremely hard part. As it is we have 2 categories for building:
* Overal Quality - Rates overall material Finish of the house
* Overal Condition - Rates Overall condition of the house
Logically thinking it should apply to whole building, so we could manipulate these vales (After ordinal encoding with dictionary) to Sale Price. But we can not, as it does not apply to:
* Lot Area
* Lot Frontage
* Porches, etc

Based on all dataset observation, it *should* apply just to living Areas. We can do it in 2 ways:
* Sum all living areas of building and make mathematical manipulations with those 2 categories
* Apply Mathematical Manipulations of each category to each are of building: ground level, 1st and 2nd floors individually

We will do both manipulations, and using smart correlation will select just best ones. Do not want to add to many new sub_features, as it can become noisy in ML.

In [None]:
from feature_engine.creation import RelativeFeatures
df['NF_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']


In [None]:
from feature_engine.creation import RelativeFeatures


living_features = ['GrLivArea', '1stFlrSF', '2ndFlrSF', 'NF_TotalLivingArea', 'OverallCond', 'OverallQual']
transformer = RelativeFeatures(
    variables=['GrLivArea', '1stFlrSF', '2ndFlrSF', 'NF_TotalLivingArea'],
    reference=['OverallCond', 'OverallQual'],
    func=["mul"]
)
df_living_area = transformer.fit_transform(df[living_features])
df_living_area.head()

Lets check correlation between all of them and select best ones

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.9,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_living_area)

living_area_sets = tr.correlated_feature_sets_
living_area_sets

In [None]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_living_area[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


We can see we are getting these sets:
* 'NF_Total_living_area_mul_OverallQual' - Logical and agreed, we will keep this sub_feature
* '1stFlrSF_mul_OverallQual' - also logical, we will keep it
* '2ndFlrSF_mul_OverallQual' - Also logical
* 'NF_Total_living_area_mul_OverallCond' - also logical.

We will keep all these new sub_features and will add code for creating new subfeatures:
```python
df['NF_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']
df['NF_TotalLivingArea_mul_OverallQual'] = df['NF_TotalLivingArea'] * df['OverallQual']
df['NF_TotalLivingArea_mul_OverallCond'] = df['NF_TotalLivingArea'] * df['OverallCond']
df['NF_1stFlrSF_mul_OverallQual'] = df['1stFlrSF'] * df['OverallQual']
df['NF_2ndFlrSF_mul_OverallQual'] = df['2ndFlrSF'] * df['OverallQual']
```

### Extraction of information form Features and creating new ones

In [None]:
df['NF_Age_Garage'] = 2010 - df['GarageYrBlt']
df['NF_Age_Build'] = 2010 - df['YearBuilt']
df['NF_Age_Remod'] = 2010 - df['YearRemodAdd']
df['NF_Remod_TEST'] = df.apply(lambda row: 0 if row['NF_Age_Build'] == row['NF_Age_Remod'] else row['NF_Age_Remod'], axis=1)

Adding code to new subfeatures creation:
```python
df['NF_Age_Garage'] = 2010 - df['GarageYrBlt']
df['NF_Age_Build'] = 2010 - df['YearBuilt']
df['NF_Age_Remod'] = 2010 - df['YearRemodAdd']
df['NF_Remod_TEST'] = df.apply(lambda row: 0 if row['NF_Age_Build'] == row['NF_Age_Remod'] else row['NF_Age_Remod'], axis=1)
```

### Checking Features if they exist and creating new ones

After Feature is crrrated, we will save them as INT - easier for Machine Learning

In [None]:
df[('NF_Has_2nd_floor')] = df.apply(lambda row: False if row['2ndFlrSF'] == 0  else True, axis=1).astype(int)
df[('NF_Has_basement')] = df.apply(lambda row: False if row['TotalBsmtSF'] == 0  else True, axis=1).astype(int)
df[('NF_Has_garage')] = df.apply(lambda row: False if row['GarageArea'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Masonry_Veneer')] = df.apply(lambda row: False if row['MasVnrArea'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Enclosed_Porch')] = df.apply(lambda row: False if row['EnclosedPorch'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Open_Porch')] = df.apply(lambda row: False if row['OpenPorchSF'] ==0  else True, axis=1).astype(int)
df['NF_Has_ANY_Porch'] = df['NF_Has_Enclosed_Porch'] | df['NF_Has_Open_Porch'].astype(int)
df[('NF_Has_Wooden_Deck')] = df.apply(lambda row: False if row['WoodDeckSF'] ==0  else True, axis=1).astype(int)

### Code to create new sub_features

In [None]:
df['NF_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['NF_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['NF_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
df['NF_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']
df['NF_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']
df['NF_TotalLivingArea_mul_OverallQual'] = df['NF_TotalLivingArea'] * df['OverallQual']
df['NF_TotalLivingArea_mul_OverallCond'] = df['NF_TotalLivingArea'] * df['OverallCond']
df['NF_1stFlrSF_mul_OverallQual'] = df['1stFlrSF'] * df['OverallQual']
df['NF_2ndFlrSF_mul_OverallQual'] = df['2ndFlrSF'] * df['OverallQual']
df['NF_Age_Garage'] = 2010 - df['GarageYrBlt']
df['NF_Age_Build'] = 2010 - df['YearBuilt']
df['NF_Age_Remod'] = 2010 - df['YearRemodAdd']
df['NF_Remod_TEST'] = df.apply(lambda row: 0 if row['NF_Age_Build'] == row['NF_Age_Remod'] else row['NF_Age_Remod'], axis=1)
df[('NF_Has_2nd_floor')] = df.apply(lambda row: False if row['2ndFlrSF'] == 0  else True, axis=1).astype(int)
df[('NF_Has_basement')] = df.apply(lambda row: False if row['TotalBsmtSF'] == 0  else True, axis=1).astype(int)
df[('NF_Has_garage')] = df.apply(lambda row: False if row['GarageArea'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Masonry_Veneer')] = df.apply(lambda row: False if row['MasVnrArea'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Enclosed_Porch')] = df.apply(lambda row: False if row['EnclosedPorch'] ==0  else True, axis=1).astype(int)
df[('NF_Has_Open_Porch')] = df.apply(lambda row: False if row['OpenPorchSF'] ==0  else True, axis=1).astype(int)
df['NF_Has_ANY_Porch'] = df['NF_Has_Enclosed_Porch'] | df['NF_Has_Open_Porch'].astype(int)
df[('NF_Has_Wooden_Deck')] = df.apply(lambda row: False if row['WoodDeckSF'] ==0  else True, axis=1).astype(int)

### Feature Engineering

Checking for any transformations needed to all features and new sub_features.


In [None]:
import pandas as pd
from feature_engine import transformation as vt
import warnings

def feat_engineering_numerical(df_feat_eng):
    """
    Applies various numerical transformations to all numerical columns in the DataFrame.
    
    Parameters:
        df_feat_eng (pd.DataFrame): The DataFrame to transform.
    
    Returns:
        pd.DataFrame: The DataFrame with original and transformed numerical columns.
    """
    # Create a deep copy of the DataFrame to avoid SettingWithCopyWarning
    df_feat_eng_copy = df_feat_eng.copy()

    # Detect numerical columns in the DataFrame
    numerical_columns = df_feat_eng_copy.select_dtypes(include='number').columns.tolist()

    # Define transformations and their corresponding column suffixes
    transformations = {
        "log_e": vt.LogTransformer(),
        "log_10": vt.LogTransformer(base='10'),
        "reciprocal": vt.ReciprocalTransformer(),
        "power": vt.PowerTransformer(),
        "box_cox": vt.BoxCoxTransformer(),
        "yeo_johnson": vt.YeoJohnsonTransformer()
    }

    # Iterate over each numerical column and apply each transformation
    for column in numerical_columns:
        for suffix, transformer in transformations.items():
            new_column_name = f"{column}_{suffix}"
            transformer.variables = [column]  # Set the variables attribute dynamically
            try:
                with warnings.catch_warnings(record=True) as w:
                    warnings.simplefilter("always")
                    # Apply transformation and assign to new column in the copy DataFrame
                    df_feat_eng_copy[new_column_name] = transformer.fit_transform(df_feat_eng_copy[[column]])
                    # Check if any warnings were raised during the transformation
                    if len(w) > 0:
                        for warning in w:
                            print(f"Warning applying {transformer.__class__.__name__} to {new_column_name}: {warning.message}")
            except Exception as e:
                # Print error message with details if transformation fails
                print(f"Error applying {transformer.__class__.__name__} to {new_column_name}: {e}")

    return df_feat_eng_copy


In [None]:
df_train_numerical_transformed = feat_engineering_numerical(df)
df_train_numerical_transformed.head()

We will use our Custom Function to plot all transformations

In [None]:
def plot_dataframe(df, target):
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy.stats import spearmanr, kendalltau, probplot

    # Configure plot settings
    save_plot = False  # Set to False if you do not wish to save the plot
    path = './plots'  # Directory to save the plots

    for col in df.columns:
        # Validate input types
        if not isinstance(df[col], pd.Series) or not isinstance(target, pd.Series):
            raise ValueError("Both feature and target must be pandas Series.")

        # Calculate correlation coefficients
        pearson_corr = df[col].corr(target, method='pearson')
        spearman_corr = spearmanr(df[col], target)[0]
        kendall_corr = kendalltau(df[col], target)[0]

        # Create the figure and axes
        fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 10), gridspec_kw={"height_ratios": [1, 8, 8]})
        fig.suptitle(f"{col}")

        # Boxplot
        sns.boxplot(data=df, x=df[col].name, ax=axes[0])
        axes[0].set_title(f"{df[col].name} Boxplot")

        # Histogram with KDE, setting KDE curve color to red
        sns.histplot(data=df, x=df[col].name, kde=True, ax=axes[1], line_kws={'color': 'red', 'lw': 2})
        axes[1].set_title(f"{df[col].name} Distribution - Histogram")

        # Q-Q plot for normality
        probplot(df[col], dist="norm", plot=axes[2], fit=True)
        axes[2].set_title(f"{df[col].name} Q-Q Plot")

        # Setting the main title for the figure
        fig.suptitle(f"{df[col].name} Plot")

        # Calculating statistics
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode()[0] if not df[col].mode().empty else 'NA'
        IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
        skewness = df[col].skew()
        kurtosis = df[col].kurt()
        outlier_range_min = df[col].quantile(0.25) - 1.5 * IQR
        outlier_range_max = df[col].quantile(0.75) + 1.5 * IQR

        # Annotations with different colors and transparency
        text_x = 0.95
        text_y = 0.95

        stats_texts = (
            f"Skewness: {'{:.2f}'.format(skewness) if isinstance(skewness, (int, float)) else 'N/A'}\n "
            f"Kurtosis: {'{:.2f}'.format(kurtosis) if isinstance(kurtosis, (int, float)) else 'N/A'}\n"
            f"Mean: {'{:.2f}'.format(mean) if isinstance(mean, (int, float)) else 'N/A'}\n "
            f"Median: {'{:.2f}'.format(median) if isinstance(median, (int, float)) else 'N/A'}\n "
            f"Mode: {'{:.2f}'.format(mode) if isinstance(mode, (int, float)) else 'N/A'}\n"
            f"IQR: {'{:.2f}'.format(IQR) if isinstance(IQR, (int, float)) else 'N/A'}\n "
            f"Non-outlier range: [{'{:.2f}'.format(outlier_range_min) if isinstance(outlier_range_min, (int, float)) else 'N/A'}, {'{:.2f}'.format(outlier_range_max) if isinstance(outlier_range_max, (int, float)) else 'N/A'}]\n"
            f"Pearson: {'{:.2f}'.format(pearson_corr) if isinstance(pearson_corr, (int, float)) else 'N/A'}\n "
            f"Spearman: {'{:.2f}'.format(spearman_corr) if isinstance(spearman_corr, (int, float)) else 'N/A'}\n "
            f"Kendall-Tau: {'{:.2f}'.format(kendall_corr) if isinstance(kendall_corr, (int, float)) else 'N/A'}"
        )

        # Place the text box on the histogram plot
        axes[1].text(text_x, text_y, stats_texts, transform=axes[1].transAxes, verticalalignment='top',
                     horizontalalignment='right', fontsize=10, bbox=dict(boxstyle="round,pad=0.5",
                                                                         facecolor='white', edgecolor='gray',
                                                                         alpha=0.9))

        # Display the plot
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        # Save the plot with the feature name as the filename
        if save_plot:
            plt.savefig(os.path.join(path, f"{df[col].name}.png"))

        plt.show()
        plt.close()

In [None]:
plot_dataframe(df_train_numerical_transformed, df['SalePrice'])

## Feature Transformations Exploration


|FEATURES|1st Option     |   |2nd Option     |   |3rd Option    |   |
|--------|---------------|---------|---------------|---------|--------------|---------|
|        |Transformation |Outliers |Transformation |Outliers |Transformation|Outliers |
|'1stFlrSF'|Yeo Johnson    |Low      |Box cox        |Low      |Log_e         |Low      |
|'2ndFlrSF'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'BedroomAbvGr'|Yeo Johnson    |Low      |Box cox        |Low      |Power         |Low      |
| 'BsmtExposure'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Low      |
|'BsmtFinSF1'|Original Values|Low      |Power          |None     |Yeo Johnson   |None     |
|'BsmtFinType1'|Original Values|None     |Yeo Johnson    |None     |Power         |None     |
| 'BsmtUnfSF'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Medium   |
| 'EnclosedPorch'|Yeo Johnson    |Low      |Power          |High     |Original Vales|Very High|
| 'GarageArea'|Yeo Johnson    |Medium   |Original Values|High     |Power         |Medium   |
| 'GarageFinish'|Yeo Johnson    |None     |Original Values|None     |Power         |Low      |
| 'GarageYrBlt'|Power          |Low      |Reciprocal     |Low      |Log_e         |Low      |
| 'GrLivArea'|Yeo Johnson    |Low      |Power          |Medium   |Log_e         |Low      |
| 'KitchenQual'|Yeo Johnson    |None     |Power          |None     |Log_e         |None     |
| 'LotArea'|Yeo Johnson    |Very High|Log_e          |Very High|Reciprocal    |Very High|
| 'LotFrontage'|Power          |Very High|Yeo Johnson    |Very High|Log_e         |Very High|
| 'MasVnrArea'|Yeo Johnson    |None     |Power          |Low      |Original Vales|X High   |
| 'OpenPorchSF'|Yeo Johnson    |None     |Power          |Low      |Original Vales|X High   |
| 'OverallCond'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Low      |
| 'OverallQual'|Original Values|Low      |Yeo Johnson    |Low      |Power         |Low      |
| 'TotalBsmtSF'|Yeo Johnson    |High     |Original Values|High     |Power         |Medium   |
| 'WoodDeckSF'|Yeo Johnson    |Low      |Power          |X High   |Original Vales|X High   |
| 'YearBuilt'|Original Values|Low      |Power          |Low      |Log_e         |Low      |
| 'YearRemodAdd'|Power          |None     |Log_e          |None     |Original Vales|None     |
| 'SalePrice'|Yeo Johnson    |High     |Log_e          |High     |Original Vales|X High   |
| 'NF_TotalBsmtSF_mul_BsmtExposure'|Yeo Johnson    |X High   |Power          |X High   |Original Vales|X High   |
|'NF_TotalBsmtSF_mul_BsmtFinType1'|Power          |Low      |Yeo Johnson    |None     |Original Vales|X High   |
|'NF_BsmtFinSF1_mul_BsmtFinType1'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'NF_GarageFinish_mul_GarageArea'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|High     |
| 'NF_Total_living_area'|Yeo Johnson    |Low      |Log_e          |Low      |Power         |Medium   |
| 'NF_Age_Garage'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'NF_Age_Build'|Power          |None     |Yeo Johnson    |None     |Original Vales|Low      |
| 'NF_Age_Remod'|Yeo Johnson    |None     |Power          |None     |Original Vales|None     |
| 'NF_Remod_TEST'|Yeo Johnson    |None     |Power          |None     |Original Vales|Medium   |
|'NF_Has_2nd_floor'|Original Values|None     |               |         |              |         |
| 'NF_Has_basement'|Original Values|Low      |               |         |              |         |
| 'NF_Has_garage'|Original Values|Low      |               |         |              |         |
| 'NF_Has_Masonry_Veneer'|Original Values|None     |               |         |              |         |
| 'NF_Has_Enclosed_Porch'|Original Values|Low      |               |         |              |         |
| 'NF_Has_Open_Porch'|Original Values|None     |               |         |              |         |
| 'NF_Has_ANY_Porch'|Original Values|None     |               |         |              |         |
| 'NF_Has_wwooden_Deck'|Original Values|Low      |               |         |              |         |
| 'NF_TotalLivingArea'|Yeo Johnson    |Low      |Log_e          |Low      |Power         |Medium   |
| 'NF_TotalLivingArea_mul_OverallQual'|Yeo Johnson    |Low      |Power          |Medium   |Log_e         |Low      |
| 'NF_TotalLivingArea_mul_OverallCond'|Yeo Johnson    |High     |Log_e          |High     |Power         |X High   |
|'NF_1stFlrSF_mul_OverallQual'|Yeo Johnson    |Low      |Power          |High     |Log_e         |Medium   |
|'NF_2ndFlrSF_mul_OverallQual'|Yeo Johnson    |None     |Power          |None     |Original Vales|Medium   |


Winsorizer can be Skipped at the moment, it will be applied Later In Model and its evalation.

I have a feeling that all these new features will be highly correlated between themselves, what is possible, it might lead to overfitting, so have to be careful by selecting hyper_parameters.

We will use Table from above when building model.

Below will be quick survey of how all features (including new ones) correlate with each other

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Compute the correlation matrix with numeric_only set to True
corr = df.corr(numeric_only=True)

# Create a heatmap
plt.figure(figsize=(30, 30))  # Set the size of the figure
plt.matshow(corr, cmap='coolwarm', fignum=1)  # Plot the correlation matrix as a heatmap

# Add color bar
plt.colorbar()

# Add labels to the x and y axes
plt.xticks(range(len(corr.columns)), corr.columns, rotation=45, ha='left')
plt.yticks(range(len(corr.columns)), corr.columns)

# Title for the heatmap
plt.title('Correlation Heatmap with Coefficients', pad=20)

# Add correlation coefficients on the heatmap
for (i, j), val in np.ndenumerate(corr.values):
    plt.text(j, i, f'{val:.2f}', ha='center', va='center', color='black')

# Show the plot
plt.show()


As Expected, very high correlation between features (because they are based on same features).

## Moving to Model Building