# Notebook 04 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/test.csv

## Outputs
* Create Clean dataset:
    * all new datasets of cleaning will be stored in inputs/datasets/cleaning
* Split created dataset in to 3 parts:
    * Train
    * Validate
    * Test
* all new datasets (train, validate and test) will be stored in outputs/datasets/cleaned

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print('Current working directory is', os.getcwd())

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

## Data Exploration

Hypothesis 2 also Failed. There is possibility, where features interact between themselves making new ones, same time we can extract useful information from existing features.
1. Encoding Changing (create dictionary for ordinal Encoder):
    * When we encode Basement Exposure and Finish type, None becomes 0, and it is fine as There is no basement.
    * When we encode Garage Finish, same issue, None becomes 0, there is no Garage
    * Kitchen Quality - Po (Poor) becomes 0, what is wrong. What if it has to be positive or negative number, it interacts with others like:
2. Create new mathematical sub_features:
    * Basement:
        * Basement Exposure mathematical manipulations with all Basement Areas
        * Basement Finish Type manipulations with all Basement Areas
    * Garage:
        * Garage Finish mathematical manipulations with Garage Area
    * Building:
        * Overall Cond mathematical manipulations with building areas
        * Overall Quality mathematical manipulations with building areas
3. Extract information and create new sub_features (we know buildings dates are up to 2010):
    * Garage Age = 2010 - Garage Year Built
    * Building Age = 2010 - Year Built
    * Remod Age = 2010 - Remodel Year
    * Remod Age Test = If House was built and remodeled same year, this vale will be 0, else Remod Age
4. Checking if house feature exist (maybe garage, porch or deck size does not matter, it mater that it is there):
    * Has 2nd floor - If area of 2nd floor > 0, we will set to True, else False
    * Has Basement - If building has basement = True, else False
    * Has Garage - If building has Garage = True, else False
    * Has Masonry Veneer - If building has masonry veneer = True, else False
    * Has Enclosed Porch - If building has Enclosed Porch = True, else False
    * Has Open Porch - If building has Open Porch = True, else False
    * Has Any Porch - If building has any type of porch = True, else False
    * Has Wooden Deck - If building Has wooden deck = True, else False

After new features created, check any correlation with existing features and new ones.

## Feature Engineering

### Categorical Features Encoding

1. We will set encoder for values, so when we encode categorical features, they receive correct, or at least logical numbers
2. We will add one more encoder with OneHotEncoder, so we can compare how they increase or decrease performance of model

In [None]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Encoding Order as specified

# Getting all categorical features as a list
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

""" For Kitchen Quality we will add 'NONE', otherwise encoding Po will be assigned 0"""
order = {
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],
    'KitchenQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
}

# Initialize the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[order['BsmtExposure'],
                                     order['BsmtFinType1'],
                                     order['GarageFinish'],
                                     order['KitchenQual']])

# Fit and Transform the data
df[categorical_features] = encoder.fit_transform(df[categorical_features])
df[categorical_features] = pd.DataFrame(df, columns=categorical_features)

### Basement Features

First we will create new sub features using RelativeFeatures

In [None]:
from feature_engine.creation import RelativeFeatures

basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
transformer = RelativeFeatures(
    variables=['BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF'],
    reference=['BsmtExposure', 'BsmtFinType1'],
    func=["sub", "mul", "add"],  # We will try to subtract, multiply and add - sum features
)
df_basement = transformer.fit_transform(df[basement_features])
df_basement.head()

Now Using SmartCorrelatedSelection we will identify sets of them, so we do not need to work with all sub_features

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.8,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_basement)

basement_feature_sets = tr.correlated_feature_sets_
basement_feature_sets

Very nice, we can see sets, based on that we will select just what we need

In [None]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_basement[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


We can see, that best features and their combinations are:
1. TotalBsmtSF * BsmtExposure => Yes it looks good and logical
2. TotalBsmtSF * BsmtFinType1 => Also logical
3. BsmtFinSF1 * BsmtFinType1 => Very Logical
4. BsmtUnfSF - BsmtFinType1 => Doubt it, it is unfinished area minus finish type 
5. TotalBsmtSF + BsmtFinType1 => also not very Logical

We will make new sub Features like this (will add to all new sub_features xxx at start, this will help to identify them):
```python
df['xxx_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['xxx_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['xxx_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
```

In [None]:
df['xxx_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['xxx_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['xxx_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']

### Garage Features

In [None]:
df['xxx_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']

Will add code to creating new sub_features:
```python
df['xxx_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']
```

### Building sub_features:

Now this is extremely hard part. As it is we have 2 categories for building:
* Overal Quality - Rates overall material Finish of the house
* Overal Condition - Rates Overall condition of the house
Logically thinking it should apply to whole building, so we could manipulate these vales (After ordinal encoding with dictionary) to Sale Price. But we can not, as it does not apply to:
* Lot Area
* Lot Frontage
* Porches, etc

Based on all dataset observation, it *should* apply just to living Areas. We can do it in 2 ways:
* Sum all living areas of building and make mathematical manipulations with those 2 categories
* Apply Mathematical Manipulations of each category to each are of building: ground level, 1st and 2nd floors individually

We will do both manipulations, and using smart correlation will select just best ones. Do not want to add to many new sub_features, as it can become noisy in ML.

In [None]:
from feature_engine.creation import RelativeFeatures
df['xxx_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']


In [None]:
from feature_engine.creation import RelativeFeatures


living_features = ['GrLivArea', '1stFlrSF', '2ndFlrSF', 'xxx_TotalLivingArea', 'OverallCond', 'OverallQual']
transformer = RelativeFeatures(
    variables=['GrLivArea', '1stFlrSF', '2ndFlrSF', 'xxx_TotalLivingArea'],
    reference=['OverallCond', 'OverallQual'],
    func=["mul"]
)
df_living_area = transformer.fit_transform(df[living_features])
df_living_area.head()

Lets check correlation between all of them and select best ones

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.9,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_living_area)

living_area_sets = tr.correlated_feature_sets_
living_area_sets

In [None]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_living_area[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


We can see we are getting these sets:
* 'xxx_Total_living_area_mul_OverallQual' - Logical and agreed, we will keep this sub_feature
* '1stFlrSF_mul_OverallQual' - also logical, we will keep it
* '2ndFlrSF_mul_OverallQual' - Also logical
* 'xxx_Total_living_area_mul_OverallCond' - also logical.

We will keep all these new sub_features and will add code for creating new subfeatures:
```python
df['xxx_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']
df['xxx_TotalLivingArea_mul_OverallQual'] = df['xxx_TotalLivingArea'] * df['OverallQual']
df['xxx_TotalLivingArea_mul_OverallCond'] = df['xxx_TotalLivingArea'] * df['OverallCond']
df['xxx_1stFlrSF_mul_OverallQual'] = df['1stFlrSF'] * df['OverallQual']
df['xxx_2ndFlrSF_mul_OverallQual'] = df['2ndFlrSF'] * df['OverallQual']
```

### Extraction of information form Features and creating new ones

In [None]:
df['xxx_Age_Garage'] = 2010 - df['GarageYrBlt']
df['xxx_Age_Build'] = 2010 - df['YearBuilt']
df['xxx_Age_Remod'] = 2010 - df['YearRemodAdd']
df['xxx_Remod_TEST'] = df.apply(lambda row: 0 if row['xxx_Age_Build'] == row['xxx_Age_Remod'] else row['xxx_Age_Remod'], axis=1)

Adding code to new subfeatures creation:
```python
df['xxx_Age_Garage'] = 2010 - df['GarageYrBlt']
df['xxx_Age_Build'] = 2010 - df['YearBuilt']
df['xxx_Age_Remod'] = 2010 - df['YearRemodAdd']
df['xxx_Remod_TEST'] = df.apply(lambda row: 0 if row['xxx_Age_Build'] == row['xxx_Age_Remod'] else row['xxx_Age_Remod'], axis=1)
```

### Checking Features if they exist and creating new ones

After Feature is crrrated, we will save them as INT - easier for Machine Learning

In [None]:
df[('xxx_Has_2nd_floor')] = df.apply(lambda row: False if row['2ndFlrSF'] == 0  else True, axis=1).astype(int)
df[('xxx_Has_basement')] = df.apply(lambda row: False if row['TotalBsmtSF'] == 0  else True, axis=1).astype(int)
df[('xxx_Has_garage')] = df.apply(lambda row: False if row['GarageArea'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Masonry_Veneer')] = df.apply(lambda row: False if row['MasVnrArea'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Enclosed_Porch')] = df.apply(lambda row: False if row['EnclosedPorch'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Open_Porch')] = df.apply(lambda row: False if row['OpenPorchSF'] ==0  else True, axis=1).astype(int)
df['xxx_Has_ANY_Porch'] = df['xxx_Has_Enclosed_Porch'] | df['xxx_Has_Open_Porch'].astype(int)
df[('xxx_Has_Wooden_Deck')] = df.apply(lambda row: False if row['WoodDeckSF'] ==0  else True, axis=1).astype(int)

### Code to create new sub_features

In [None]:
df['xxx_TotalBsmtSF_mul_BsmtExposure'] = df['TotalBsmtSF'] * df['BsmtExposure']
df['xxx_TotalBsmtSF_mul_BsmtFinType1'] = df['TotalBsmtSF'] * df['BsmtFinSF1']
df['xxx_BsmtFinSF1_mul_BsmtFinType1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
df['xxx_GarageFinish_mul_GarageArea'] = df['GarageFinish'] * df['GarageArea']
df['xxx_TotalLivingArea'] = df['GrLivArea'] + df['1stFlrSF'] + df['2ndFlrSF']
df['xxx_TotalLivingArea_mul_OverallQual'] = df['xxx_TotalLivingArea'] * df['OverallQual']
df['xxx_TotalLivingArea_mul_OverallCond'] = df['xxx_TotalLivingArea'] * df['OverallCond']
df['xxx_1stFlrSF_mul_OverallQual'] = df['1stFlrSF'] * df['OverallQual']
df['xxx_2ndFlrSF_mul_OverallQual'] = df['2ndFlrSF'] * df['OverallQual']
df['xxx_Age_Garage'] = 2010 - df['GarageYrBlt']
df['xxx_Age_Build'] = 2010 - df['YearBuilt']
df['xxx_Age_Remod'] = 2010 - df['YearRemodAdd']
df['xxx_Remod_TEST'] = df.apply(lambda row: 0 if row['xxx_Age_Build'] == row['xxx_Age_Remod'] else row['xxx_Age_Remod'], axis=1)
df[('xxx_Has_2nd_floor')] = df.apply(lambda row: False if row['2ndFlrSF'] == 0  else True, axis=1).astype(int)
df[('xxx_Has_basement')] = df.apply(lambda row: False if row['TotalBsmtSF'] == 0  else True, axis=1).astype(int)
df[('xxx_Has_garage')] = df.apply(lambda row: False if row['GarageArea'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Masonry_Veneer')] = df.apply(lambda row: False if row['MasVnrArea'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Enclosed_Porch')] = df.apply(lambda row: False if row['EnclosedPorch'] ==0  else True, axis=1).astype(int)
df[('xxx_Has_Open_Porch')] = df.apply(lambda row: False if row['OpenPorchSF'] ==0  else True, axis=1).astype(int)
df['xxx_Has_ANY_Porch'] = df['xxx_Has_Enclosed_Porch'] | df['xxx_Has_Open_Porch'].astype(int)
df[('xxx_Has_Wooden_Deck')] = df.apply(lambda row: False if row['WoodDeckSF'] ==0  else True, axis=1).astype(int)

### Feature Engineering

Checking for any transformations needed to all features and new sub_features.


In [None]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set(style="whitegrid")
import warnings

warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None, plot=False):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape 
    - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions
  
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respectives column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

        if plot == True:
            # For each variable, assess how the transformations perform
            transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)
        else:
            pass

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    ### Check analysis type
    if analysis_type == None:
        raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    # Initialize an empty list for storing transformer names
    list_column_transformers = []

    # Numerical analysis transformers
    if analysis_type == 'numerical':
        list_column_transformers = ["log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    # Transformer for ordinal encoding
    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    # Transformer for outlier handling via winsorization
    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng, column)

    elif analysis_type == 'onehot_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_OneHotEncoder(df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=['#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


from sklearn.preprocessing import OneHotEncoder


def FeatEngineering_OneHotEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OneHotEncoder(sparse=False, drop='first')
        encoded_cols = encoder.fit_transform(df_feat_eng[[column]])
        encoded_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out([column]))
        df_feat_eng = pd.concat([df_feat_eng, encoded_df], axis=1)
        list_methods_worked.append(f"{column}_onehot_encoder")

    except Exception as e:
        print(f"Error occurred: {e}")
        df_feat_eng.drop([f"{column}_onehot_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    ### Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


import numpy as np
import pandas as pd


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

In [None]:
df = FeatureEngineeringAnalysis(df, analysis_type='numerical', plot=False)

We will use our Custom Function to plot all transformations

In [None]:
def plot_dataframe(df, target):
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy.stats import spearmanr, kendalltau, probplot

    # Configure plot settings
    save_plot = True  # Set to False if you do not wish to save the plot
    path = './plots'  # Directory to save the plots

    for col in df.columns:
        # Validate input types
        if not isinstance(df[col], pd.Series) or not isinstance(target, pd.Series):
            raise ValueError("Both feature and target must be pandas Series.")

        # Calculate correlation coefficients
        pearson_corr = df[col].corr(target, method='pearson')
        spearman_corr = spearmanr(df[col], target)[0]
        kendall_corr = kendalltau(df[col], target)[0]

        # Create the figure and axes
        fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 10), gridspec_kw={"height_ratios": [1, 8, 8]})
        fig.suptitle(f"{col}")

        # Boxplot
        sns.boxplot(data=df, x=df[col].name, ax=axes[0])
        axes[0].set_title(f"{df[col].name} Boxplot")

        # Histogram with KDE, setting KDE curve color to red
        sns.histplot(data=df, x=df[col].name, kde=True, ax=axes[1], line_kws={'color': 'red', 'lw': 2})
        axes[1].set_title(f"{df[col].name} Distribution - Histogram")

        # Q-Q plot for normality
        probplot(df[col], dist="norm", plot=axes[2], fit=True)
        axes[2].set_title(f"{df[col].name} Q-Q Plot")

        # Setting the main title for the figure
        fig.suptitle(f"{df[col].name} Plot")

        # Calculating statistics
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode()[0] if not df[col].mode().empty else 'NA'
        IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
        skewness = df[col].skew()
        kurtosis = df[col].kurt()
        outlier_range_min = df[col].quantile(0.25) - 1.5 * IQR
        outlier_range_max = df[col].quantile(0.75) + 1.5 * IQR

        # Annotations with different colors and transparency
        text_x = 0.95
        text_y = 0.95

        stats_texts = (
            f"Skewness: {'{:.2f}'.format(skewness) if isinstance(skewness, (int, float)) else 'N/A'}\n "
            f"Kurtosis: {'{:.2f}'.format(kurtosis) if isinstance(kurtosis, (int, float)) else 'N/A'}\n"
            f"Mean: {'{:.2f}'.format(mean) if isinstance(mean, (int, float)) else 'N/A'}\n "
            f"Median: {'{:.2f}'.format(median) if isinstance(median, (int, float)) else 'N/A'}\n "
            f"Mode: {'{:.2f}'.format(mode) if isinstance(mode, (int, float)) else 'N/A'}\n"
            f"IQR: {'{:.2f}'.format(IQR) if isinstance(IQR, (int, float)) else 'N/A'}\n "
            f"Non-outlier range: [{'{:.2f}'.format(outlier_range_min) if isinstance(outlier_range_min, (int, float)) else 'N/A'}, {'{:.2f}'.format(outlier_range_max) if isinstance(outlier_range_max, (int, float)) else 'N/A'}]\n"
            f"Pearson: {'{:.2f}'.format(pearson_corr) if isinstance(pearson_corr, (int, float)) else 'N/A'}\n "
            f"Spearman: {'{:.2f}'.format(spearman_corr) if isinstance(spearman_corr, (int, float)) else 'N/A'}\n "
            f"Kendall-Tau: {'{:.2f}'.format(kendall_corr) if isinstance(kendall_corr, (int, float)) else 'N/A'}"
        )

        # Place the text box on the histogram plot
        axes[1].text(text_x, text_y, stats_texts, transform=axes[1].transAxes, verticalalignment='top',
                     horizontalalignment='right', fontsize=10, bbox=dict(boxstyle="round,pad=0.5",
                                                                         facecolor='white', edgecolor='gray',
                                                                         alpha=0.9))

        # Display the plot
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        # Save the plot with the feature name as the filename
        if save_plot:
            plt.savefig(os.path.join(path, f"{df[col].name}.png"))

        #plt.show()
        plt.close()

In [None]:
plot_dataframe(df, df['SalePrice'])

## Feature Transformations Exploration


|FEATURES|1st Option     |   |2nd Option     |   |3rd Option    |   |
|--------|---------------|---------|---------------|---------|--------------|---------|
|        |Transformation |Outliers |Transformation |Outliers |Transformation|Outliers |
|'1stFlrSF'|Yeo Johnson    |Low      |Box cox        |Low      |Log_e         |Low      |
|'2ndFlrSF'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'BedroomAbvGr'|Yeo Johnson    |Low      |Box cox        |Low      |Power         |Low      |
| 'BsmtExposure'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Low      |
|'BsmtFinSF1'|Original Values|Low      |Power          |None     |Yeo Johnson   |None     |
|'BsmtFinType1'|Original Values|None     |Yeo Johnson    |None     |Power         |None     |
| 'BsmtUnfSF'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Medium   |
| 'EnclosedPorch'|Yeo Johnson    |Low      |Power          |High     |Original Vales|Very High|
| 'GarageArea'|Yeo Johnson    |Medium   |Original Values|High     |Power         |Medium   |
| 'GarageFinish'|Yeo Johnson    |None     |Original Values|None     |Power         |Low      |
| 'GarageYrBlt'|Power          |Low      |Reciprocal     |Low      |Log_e         |Low      |
| 'GrLivArea'|Yeo Johnson    |Low      |Power          |Medium   |Log_e         |Low      |
| 'KitchenQual'|Yeo Johnson    |None     |Power          |None     |Log_e         |None     |
| 'LotArea'|Yeo Johnson    |Very High|Log_e          |Very High|Reciprocal    |Very High|
| 'LotFrontage'|Power          |Very High|Yeo Johnson    |Very High|Log_e         |Very High|
| 'MasVnrArea'|Yeo Johnson    |None     |Power          |Low      |Original Vales|X High   |
| 'OpenPorchSF'|Yeo Johnson    |None     |Power          |Low      |Original Vales|X High   |
| 'OverallCond'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|Low      |
| 'OverallQual'|Original Values|Low      |Yeo Johnson    |Low      |Power         |Low      |
| 'TotalBsmtSF'|Yeo Johnson    |High     |Original Values|High     |Power         |Medium   |
| 'WoodDeckSF'|Yeo Johnson    |Low      |Power          |X High   |Original Vales|X High   |
| 'YearBuilt'|Original Values|Low      |Power          |Low      |Log_e         |Low      |
| 'YearRemodAdd'|Power          |None     |Log_e          |None     |Original Vales|None     |
| 'SalePrice'|Yeo Johnson    |High     |Log_e          |High     |Original Vales|X High   |
| 'xxx_TotalBsmtSF_mul_BsmtExposure'|Yeo Johnson    |X High   |Power          |X High   |Original Vales|X High   |
|'xxx_TotalBsmtSF_mul_BsmtFinType1'|Power          |Low      |Yeo Johnson    |None     |Original Vales|X High   |
|'xxx_BsmtFinSF1_mul_BsmtFinType1'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'xxx_GarageFinish_mul_GarageArea'|Yeo Johnson    |Low      |Power          |Low      |Original Vales|High     |
| 'xxx_Total_living_area'|Yeo Johnson    |Low      |Log_e          |Low      |Power         |Medium   |
| 'xxx_Age_Garage'|Yeo Johnson    |None     |Power          |None     |Original Vales|Low      |
| 'xxx_Age_Build'|Power          |None     |Yeo Johnson    |None     |Original Vales|Low      |
| 'xxx_Age_Remod'|Yeo Johnson    |None     |Power          |None     |Original Vales|None     |
| 'xxx_Remod_TEST'|Yeo Johnson    |None     |Power          |None     |Original Vales|Medium   |
|'xxx_Has_2nd_floor'|Original Values|None     |               |         |              |         |
| 'xxx_Has_basement'|Original Values|Low      |               |         |              |         |
| 'xxx_Has_garage'|Original Values|Low      |               |         |              |         |
| 'xxx_Has_Masonry_Veneer'|Original Values|None     |               |         |              |         |
| 'xxx_Has_Enclosed_Porch'|Original Values|Low      |               |         |              |         |
| 'xxx_Has_Open_Porch'|Original Values|None     |               |         |              |         |
| 'xxx_Has_ANY_Porch'|Original Values|None     |               |         |              |         |
| 'xxx_Has_wwooden_Deck'|Original Values|Low      |               |         |              |         |
| 'xxx_TotalLivingArea'|Yeo Johnson    |Low      |Log_e          |Low      |Power         |Medium   |
| 'xxx_TotalLivingArea_mul_OverallQual'|Yeo Johnson    |Low      |Power          |Medium   |Log_e         |Low      |
| 'xxx_TotalLivingArea_mul_OverallCond'|Yeo Johnson    |High     |Log_e          |High     |Power         |X High   |
|'xxx_1stFlrSF_mul_OverallQual'|Yeo Johnson    |Low      |Power          |High     |Log_e         |Medium   |
|'xxx_2ndFlrSF_mul_OverallQual'|Yeo Johnson    |None     |Power          |None     |Original Vales|Medium   |


All Features Original Values or Transformations, where outliers are High or X high, will be applied Winsorizer