# Notebook 04 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/test.csv

## Outputs
* Create Clean dataset:
    * all new datasets of cleaning will be stored in inputs/datasets/cleaning
* Split created dataset in to 3 parts:
    * Train
    * Validate
    * Test
* all new datasets (train, validate and test) will be stored in outputs/datasets/cleaned

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print('Current working directory is', os.getcwd())

Current working directory is /Users/pecukevicius/DataspellProjects/heritage_houses_p5


## Loading Dataset

In [3]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df.drop(columns=['Unnamed: 0'], inplace=True)
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,GarageFinish,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
618,1828,0,2,Av,48,Unf,1774,0,774,Unf,...,90,452,108,5,9,1822,0,2007,2007,314813
870,894,0,2,No,0,Unf,894,0,308,Unf,...,60,0,0,5,5,894,0,1962,1962,109500
92,964,0,2,No,713,ALQ,163,0,432,Unf,...,80,0,0,7,5,876,0,1921,2006,163500
817,1689,0,3,No,1218,GLQ,350,0,857,RFn,...,70,148,59,5,8,1568,0,2002,2002,271000
302,1541,0,3,No,0,Unf,1541,0,843,RFn,...,118,150,81,5,7,1541,0,2001,2002,205000


## Data Exploration

Hypothesis 2 also Failed. There is possibility, where features interact between themselves making new ones
1. When we encode Basement Exposure and Finish type, None becomes 0, and it is fine as There is no basement.
2. When we encode Garage Finish, same issue, None becomes 0, there is no Garage
3. Kitchen Quality - Po (Poor) becomes 0, what is wrong. What if it has to be positive or negative number, it interacts with others like:
    * Certain are of building is multiplied by it, we get 0 value. Need to explore more deep
4. Need testing, how Categorical variables can be combined with others from same category.
5. We have possibility that Overall Condition and Quality can impact on all house. That means all the features, except area lot, lot frontage and similar.

## Creating new features and exploring correlations.
We will create more sub_features

## Feature Engineering

### Categorical Features Encoding

1. We will set encoder for values, so when we encode categorical features, they receive correct, or at least logical numbers
2. We will add one more encoder with OneHotEncoder, so we can compare how they increase or decrease performance of model

In [4]:
from sklearn.preprocessing import OrdinalEncoder
import pandas as pd

# Encoding Order as specified

# Getting all categorical features as a list
categorical_features = df.select_dtypes(include=['object', 'category']).columns.tolist()

""" For Kitchen Quality we will add 'NONE', otherwise encoding Po will be assigned 0"""
order = {
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],
    'KitchenQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex']
}

# Initialize the OrdinalEncoder with the specified order
encoder = OrdinalEncoder(categories=[order['BsmtExposure'],
                                     order['BsmtFinType1'],
                                     order['GarageFinish'],
                                     order['KitchenQual']])

# Fit and Transform the data
df[categorical_features] = encoder.fit_transform(df[categorical_features])
df[categorical_features] = pd.DataFrame(df, columns=categorical_features)

### Basement Features

First we will create new sub features using RelativeFeatures

In [5]:
from feature_engine.creation import RelativeFeatures

basement_features = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF']
transformer = RelativeFeatures(
    variables=['BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF'],
    reference=['BsmtExposure', 'BsmtFinType1'],
    func=["sub", "mul", "add"],  # We will try to subtract, multiply and add - sum features
)
df_basement = transformer.fit_transform(df[basement_features])
df_basement.head()

Unnamed: 0,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtUnfSF,TotalBsmtSF,BsmtFinSF1_sub_BsmtExposure,BsmtUnfSF_sub_BsmtExposure,TotalBsmtSF_sub_BsmtExposure,BsmtFinSF1_sub_BsmtFinType1,BsmtUnfSF_sub_BsmtFinType1,...,TotalBsmtSF_mul_BsmtExposure,BsmtFinSF1_mul_BsmtFinType1,BsmtUnfSF_mul_BsmtFinType1,TotalBsmtSF_mul_BsmtFinType1,BsmtFinSF1_add_BsmtExposure,BsmtUnfSF_add_BsmtExposure,TotalBsmtSF_add_BsmtExposure,BsmtFinSF1_add_BsmtFinType1,BsmtUnfSF_add_BsmtFinType1,TotalBsmtSF_add_BsmtFinType1
618,3.0,1.0,48,1774,1822,45.0,1771.0,1819.0,47.0,1773.0,...,5466.0,48.0,1774.0,1822.0,51.0,1777.0,1825.0,49.0,1775.0,1823.0
870,1.0,1.0,0,894,894,-1.0,893.0,893.0,-1.0,893.0,...,894.0,0.0,894.0,894.0,1.0,895.0,895.0,1.0,895.0,895.0
92,1.0,5.0,713,163,876,712.0,162.0,875.0,708.0,158.0,...,876.0,3565.0,815.0,4380.0,714.0,164.0,877.0,718.0,168.0,881.0
817,1.0,6.0,1218,350,1568,1217.0,349.0,1567.0,1212.0,344.0,...,1568.0,7308.0,2100.0,9408.0,1219.0,351.0,1569.0,1224.0,356.0,1574.0
302,1.0,1.0,0,1541,1541,-1.0,1540.0,1540.0,-1.0,1540.0,...,1541.0,0.0,1541.0,1541.0,1.0,1542.0,1542.0,1.0,1542.0,1542.0


Now Using SmartCorrelatedSelection we will identify sets of them, so we do not need to work with all sub_features

In [6]:
from feature_engine.selection import SmartCorrelatedSelection

tr = SmartCorrelatedSelection(
    variables=None,
    method="pearson",
    threshold=0.8,
    missing_values="raise",
    selection_method="variance",
    estimator=None,
)

tr.fit_transform(df_basement)

basement_feature_sets = tr.correlated_feature_sets_
basement_feature_sets

[{'BsmtExposure', 'TotalBsmtSF_mul_BsmtExposure'},
 {'BsmtFinType1', 'TotalBsmtSF_mul_BsmtFinType1'},
 {'BsmtFinSF1',
  'BsmtFinSF1_add_BsmtExposure',
  'BsmtFinSF1_add_BsmtFinType1',
  'BsmtFinSF1_mul_BsmtFinType1',
  'BsmtFinSF1_sub_BsmtExposure',
  'BsmtFinSF1_sub_BsmtFinType1'},
 {'BsmtUnfSF',
  'BsmtUnfSF_add_BsmtExposure',
  'BsmtUnfSF_add_BsmtFinType1',
  'BsmtUnfSF_sub_BsmtExposure',
  'BsmtUnfSF_sub_BsmtFinType1'},
 {'TotalBsmtSF',
  'TotalBsmtSF_add_BsmtExposure',
  'TotalBsmtSF_add_BsmtFinType1',
  'TotalBsmtSF_sub_BsmtExposure',
  'TotalBsmtSF_sub_BsmtFinType1'}]

Very nice, we can see sets, based on that we will select just what we need

In [7]:
selected_features = []

for feature_set in tr.correlated_feature_sets_:
    # Calculate variances within each set
    variances = {feature: df_basement[feature].var() for feature in feature_set}
    # Select the feature with the highest variance
    best_feature = max(variances, key=variances.get)
    selected_features.append(best_feature)

print("Selected features:", selected_features)


Selected features: ['TotalBsmtSF_mul_BsmtExposure', 'TotalBsmtSF_mul_BsmtFinType1', 'BsmtFinSF1_mul_BsmtFinType1', 'BsmtUnfSF_sub_BsmtFinType1', 'TotalBsmtSF_add_BsmtFinType1']


We can see, that best features and their combinations are:
1. TotalBsmtSF * BsmtExposure => Yes it looks good and logical
2. TotalBsmtSF * BsmtFinType1 => Also logical
3. BsmtFinSF1 * BsmtFinType1 => Very Logical
4. BsmtUnfSF - BsmtFinType1 => Doubt it, it is unfinished area minus finish type 
5. TotalBsmtSF + BsmtFinType1 => also not very Logical

Let's check Correlations between new sub features and sale price. will make it as dataframe where:
* Column - correlation coefficient
* Row - new sub feature

In [8]:
import pandas as pd
from scipy.stats import spearmanr, kendalltau

sale_price = df['SalePrice']  # Extracting the SalePrice as a Series

# Initialize an empty DataFrame to store the correlations
correlation_df = pd.DataFrame(index=df_basement.columns, columns=['Pearson', 'Spearman', 'KendallTau'])

# Calculate Pearson correlations using pandas
correlation_df['Pearson'] = df_basement.corrwith(sale_price).dropna()

# Calculate Spearman and Kendall Tau correlations using scipy
for feature in df_basement.columns:
    # Ensure no NaN values interfere with correlation calculation
    clean_data = pd.concat([df_basement[feature], sale_price], axis=1).dropna()

    # Calculate Spearman and Kendall correlations
    spearman_corr, _ = spearmanr(clean_data[feature], clean_data['SalePrice'])
    kendall_corr, _ = kendalltau(clean_data[feature], clean_data['SalePrice'])

    # Store the results in the DataFrame
    correlation_df.at[feature, 'Spearman'] = spearman_corr
    correlation_df.at[feature, 'KendallTau'] = kendall_corr

correlation_df


Unnamed: 0,Pearson,Spearman,KendallTau
BsmtExposure,0.36336,0.334323,0.262249
BsmtFinType1,0.278139,0.331448,0.253472
BsmtFinSF1,0.392219,0.323839,0.238759
BsmtUnfSF,0.208668,0.169188,0.116486
TotalBsmtSF,0.635535,0.596989,0.429733
BsmtFinSF1_sub_BsmtExposure,0.391694,0.287856,0.211459
BsmtUnfSF_sub_BsmtExposure,0.207742,0.167175,0.114855
TotalBsmtSF_sub_BsmtExposure,0.63521,0.59667,0.429314
BsmtFinSF1_sub_BsmtFinType1,0.392129,0.307663,0.226797
BsmtUnfSF_sub_BsmtFinType1,0.206846,0.16619,0.114109


As it was thought... I doubt making sub_features for basement is needed. it is logical, but not sure.
If making new sub_features I would make like this:
1. BsmtFinType1 * BsmtFinSF1
2. BsmtExposure * TotalBsmtSF
3. BsmtUnfSF

Will try in ML with such combination, and one more time without. Will see how it will work better.


### Garage Features

I believe in Garage we will have the same issue, thus we will make another sub_feature:
GarageArea * GarageFinish. Test ML with this sub_feature and without

### Checking skewness, kurtosis, outliers and does it need scaling or normalization, if needed

Sub_features to check:
1. BsmtFinType1 * BsmtFinSF1
2. BsmtExposure * TotalBsmtSF
3. GarageArea * GarageFinish

In [None]:
df['BsmtFinType1_BsmtFinSF1'] = df['BsmtFinType1'] * df['BsmtFinSF1']
df['BsmtExposure_TotalBsmtSF'] = df['BsmtExposure'] * df['TotalBsmtSF']
df['GarageArea_GarageFinish'] = df['GarageArea'] * df['GarageFinish']

In [None]:
df.columns.tolist()

In [None]:
def plot_dataframe(df, target):
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from scipy.stats import spearmanr, kendalltau, probplot

    # Configure plot settings
    save_plot = False  # Set to False if you do not wish to save the plot
    path = './plots'  # Directory to save the plots

    for col in df.columns:
        # Validate input types
        if not isinstance(df[col], pd.Series) or not isinstance(target, pd.Series):
            raise ValueError("Both feature and target must be pandas Series.")

        # Calculate correlation coefficients
        pearson_corr = df[col].corr(target, method='pearson')
        spearman_corr = spearmanr(df[col], target)[0]
        kendall_corr = kendalltau(df[col], target)[0]

        # Create the figure and axes
        fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 10), gridspec_kw={"height_ratios": [1, 8, 8]})
        fig.suptitle(f"{col}")

        # Boxplot
        sns.boxplot(data=df, x=df[col].name, ax=axes[0])
        axes[0].set_title(f"{df[col].name} Boxplot")

        # Histogram with KDE, setting KDE curve color to red
        sns.histplot(data=df, x=df[col].name, kde=True, ax=axes[1], line_kws={'color': 'red', 'lw': 2})
        axes[1].set_title(f"{df[col].name} Distribution - Histogram")

        # Q-Q plot for normality
        probplot(df[col], dist="norm", plot=axes[2], fit=True)
        axes[2].set_title(f"{df[col].name} Q-Q Plot")

        # Setting the main title for the figure
        fig.suptitle(f"{df[col].name} Plot")

        # Calculating statistics
        mean = df[col].mean()
        median = df[col].median()
        mode = df[col].mode()[0] if not df[col].mode().empty else 'NA'
        IQR = df[col].quantile(0.75) - df[col].quantile(0.25)
        skewness = df[col].skew()
        kurtosis = df[col].kurt()
        outlier_range_min = df[col].quantile(0.25) - 1.5 * IQR
        outlier_range_max = df[col].quantile(0.75) + 1.5 * IQR

        # Annotations with different colors and transparency
        text_x = 0.95
        text_y = 0.95

        stats_texts = (
            f"Skewness: {'{:.2f}'.format(skewness) if isinstance(skewness, (int, float)) else 'N/A'}\n "
            f"Kurtosis: {'{:.2f}'.format(kurtosis) if isinstance(kurtosis, (int, float)) else 'N/A'}\n"
            f"Mean: {'{:.2f}'.format(mean) if isinstance(mean, (int, float)) else 'N/A'}\n "
            f"Median: {'{:.2f}'.format(median) if isinstance(median, (int, float)) else 'N/A'}\n "
            f"Mode: {'{:.2f}'.format(mode) if isinstance(mode, (int, float)) else 'N/A'}\n"
            f"IQR: {'{:.2f}'.format(IQR) if isinstance(IQR, (int, float)) else 'N/A'}\n "
            f"Non-outlier range: [{'{:.2f}'.format(outlier_range_min) if isinstance(outlier_range_min, (int, float)) else 'N/A'}, {'{:.2f}'.format(outlier_range_max) if isinstance(outlier_range_max, (int, float)) else 'N/A'}]\n"
            f"Pearson: {'{:.2f}'.format(pearson_corr) if isinstance(pearson_corr, (int, float)) else 'N/A'}\n "
            f"Spearman: {'{:.2f}'.format(spearman_corr) if isinstance(spearman_corr, (int, float)) else 'N/A'}\n "
            f"Kendall-Tau: {'{:.2f}'.format(kendall_corr) if isinstance(kendall_corr, (int, float)) else 'N/A'}"
        )

        # Place the text box on the histogram plot
        axes[1].text(text_x, text_y, stats_texts, transform=axes[1].transAxes, verticalalignment='top',
                     horizontalalignment='right', fontsize=10, bbox=dict(boxstyle="round,pad=0.5",
                                                                         facecolor='white', edgecolor='gray',
                                                                         alpha=0.9))

        # Display the plot
        plt.tight_layout(rect=[0, 0.03, 1, 0.95])
        # Save the plot with the feature name as the filename
        if save_plot:
            plt.savefig(os.path.join(path, f"{df[col].name}.png"))

        plt.show()
        plt.close()



In [None]:
plot_dataframe(df[['BsmtFinType1_BsmtFinSF1', 'BsmtExposure_TotalBsmtSF', 'GarageArea_GarageFinish']],
               df['SalePrice'])

We can see, all these new sub_features also will need transformations.

In [None]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set(style="whitegrid")
import warnings

warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None, plot=False):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape 
    - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions
  
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respectives column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

        if plot == True:
            # For each variable, assess how the transformations perform
            transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)
        else:
            pass

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    ### Check analysis type
    if analysis_type == None:
        raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    # Initialize an empty list for storing transformer names
    list_column_transformers = []

    # Numerical analysis transformers
    if analysis_type == 'numerical':
        list_column_transformers = ["log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    # Transformer for ordinal encoding
    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    # Transformer for outlier handling via winsorization
    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng, column)

    elif analysis_type == 'onehot_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_OneHotEncoder(df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=['#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


from sklearn.preprocessing import OneHotEncoder


def FeatEngineering_OneHotEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OneHotEncoder(sparse=False, drop='first')
        encoded_cols = encoder.fit_transform(df_feat_eng[[column]])
        encoded_df = pd.DataFrame(encoded_cols, columns=encoder.get_feature_names_out([column]))
        df_feat_eng = pd.concat([df_feat_eng, encoded_df], axis=1)
        list_methods_worked.append(f"{column}_onehot_encoder")

    except Exception as e:
        print(f"Error occurred: {e}")
        df_feat_eng.drop([f"{column}_onehot_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    ### Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


import numpy as np
import pandas as pd


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked

In [None]:
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
numerical_features

In [None]:
df = FeatureEngineeringAnalysis(df[numerical_features], analysis_type='numerical', plot=False)
plot_dataframe(df, df['SalePrice'])

## We are creating ydata report file, There is a lot of plots.

### Transformations Summary.

1. 1stFlrSF - Log_e has outliers; log_10 has outliers,; box_cox has outliers; yeo_johnson has outliers
2. 2ndFlrSF - Power no outliers; yeo_johnson no outliers, lowest skew 0.36
3. BedroomAbvGr - box_cox has outliers;  yeo has outliers skew 0.04
4. BsmtExposure - power has outliers; yeo_johnson skew -0.03 has outliers
5. BsmtFinSF1 - power no outliers, skew - 0.18
6. BsmtFinType1 - Original values no outliers, skew 0.02
7. BsmtUnfSF - power has outliers, skew -0.23, yeo_johnson has outliers, skew -0.17
8. EnclosedPorch - yeo_johnson the lowest skew 7.88 has outliers. Test removing outliers, then test in model, probably discard
9. GarageArea - original values has outliers, skew 0.17;
10. GarageFinish - original values no outliers, skew 0.33; yeo_johnson no outliers, skew 0.03
11. GarageYrBlt - original values, has outliers, skew -.068, kurtosis -0.24; remove outliers, test in model possible discard
12. GrLivarea - log_e skew -.07 has outliers; log_10 has outliers skew -0.07; box_cox any yeo_johnson has outliers but skew = 0
13. KitchenQual - original values skew 0.40 no outliers, kurtosis -0.22; box_cox and yeo_johnson skew 0, no outliers, low kurtosis (test which is better in model)
14. LotArea - Log_e skew -0.05 lots of outliers; box_cox and yeo_johnson skew 0.01 lots outliers (test removing them, then test in model, possible discard)
15. LotFrontage - power skew 0.10 lots of outliers (try removing outliers, then test on model, possible discard)
16. MasVnrArea - yeo_johnson no outliers, skew 0.46
17. OpenPorch - yeo_johnson no outliers, skew -0.02
18. OverallCond - yeo_johnson the lowest skew 0.04, has outliers, box_cox has outliers skew 0.09
19. OverallQual - original values skew 0.17, has outliers; box_cox skew 0.03, yeo_johnson skew 0.03, other have outliers
20. TotalBsmtSF - yeo_johnson skew 0.01 and lots of outliers
21. WoodDeckSF - yeo_johnson lowest skew 3.95 has outliers (test removing outliers, possible discard in model)
22. YearBuilt - original values has outliers (test removing them)
23. YearRemodAdd - original values skew -0.49 no outliers;
24. Target - SalePrice - original values skew 1.74 has outliers; log_e and log_10 has outliers skew 0.03; box_cox and yeo_johnson skew 0, has outliers
25. BsmtFinType1_BsmtFinSF1 - power skew 0.16 has outliers; 
26. BsmtExposure_TotalBsmtSF - yeo_johnson lots outliers, skew 0.22
27. GarageArea_GarageFinish - power skew -0.10 has outliers

Let's try removing outliers from transformations above and see how it looks 

In [None]:
df_outliers = df[
    ['1stFlrSF_log_e', '1stFlrSF_log_10', '1stFlrSF_box_cox', '1stFlrSF_yeo_johnson', 'BedroomAbvGr_box_cox',
     'BedroomAbvGr_yeo_johnson', 'BsmtExposure_power', 'BsmtExposure_yeo_johnson', 'BsmtFinSF1_power',
     'BsmtUnfSF_power', 'BsmtUnfSF_yeo_johnson', 'EnclosedPorch_yeo_johnson', 'GarageArea', 'GarageYrBlt',
     'GrLivArea_log_e', 'GrLivArea_log_10', 'GrLivArea_box_cox', 'GrLivArea_yeo_johnson', 'KitchenQual_box_cox',
     'KitchenQual_yeo_johnson', 'LotArea_log_e', 'LotArea_log_10', 'LotArea_box_cox', 'LotArea_yeo_johnson',
     'LotFrontage_power', 'OverallCond_box_cox', 'OverallCond_yeo_johnson', 'OverallQual_box_cox',
     'OverallQual_yeo_johnson', 'TotalBsmtSF_yeo_johnson', 'WoodDeckSF_yeo_johnson', 'YearBuilt', 'YearRemodAdd',
     'SalePrice', 'SalePrice_log_e', 'SalePrice_log_10', 'SalePrice_box_cox', 'SalePrice_yeo_johnson',
     'BsmtExposure_TotalBsmtSF_yeo_johnson', 'GarageArea_GarageFinish_power']]

In [None]:
df_outliers = FeatureEngineeringAnalysis(df=df_outliers, analysis_type='outlier_winsorizer', plot=False)

In [None]:
plot_dataframe(df_outliers, df['SalePrice'])

## Outcome

We will be selecting these transformations:
1. Yeo Johnson: [‘1stFlrSF’, ‘2ndFlrSF’, ‘BedroomAbvGr’, ‘BsmtExposure’, ‘BsmtUnfSF’, ‘EnclosedPorch’, ‘GarageFinish’, ‘LotArea’, ‘MasVnrArea’, ‘OpenPorch’, ‘OverallCond’, ‘OverallQual’, ‘TotalBsmtSF’, ‘WoodDeckSF’, ‘BsmtExposure_TotalBsmtSF']
2. Power: [‘BsmtFinSF1’, ‘LotFrontage’, ‘BsmtFinType1_BsmtFinSF1’, ‘GarageArea_GarageFinish’]
3. Box cox: [‘GrLiveArea’]
4. Winsorizer test: [‘1stFlrSF’, ‘BedroomAbvGr’, ‘BsmtExposure’, ‘BsmtUnfSF’, ‘GarageYrBlt’, ‘GrLiveArea’, ‘OverallCond’, ‘OverallQual’, ‘YearBuilt’, ‘GarageArea_GarageFinish’, ‘TotalBsmtSF’]
5. Winsorizer MUST: [‘LotArea’, ‘LotFrontage’, 'BsmtExposure_TotalBsmtSF'] 
6. Original Values: [‘BsmtFinType1’, ‘GarageArea’, ‘GarageFinish’, ‘GarageYrBlt’, ‘KitchenQual’, ‘KitchenQual’, 'YearBuilt’, ‘YearRemodAdd’]
7. Try removing Features: [‘EnclosedPorch’,’LotFrontage’]

Also need testing with Target Variations:

1. Orioginal Values wirth Winsorizer
2. Log_e with Winsorizer
3. Box_Cox with Winsorizer
4. Yeo Johnson with Winsorizer


