# Notebook 04 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/test.csv

## Outputs
* Create Clean dataset:
    * all new datasets of cleaning will be stored in inputs/datasets/cleaning
* Split created dataset in to 3 parts:
    * Train
    * Validate
    * Test
* all new datasets (train, validate and test) will be stored in outputs/datasets/cleaned

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os

current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

you have set a new current directory


Confirm new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/pecukevicius/DataspellProjects/heritage_houses_p5'

## Loading Dataset

In [4]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/test.parquet.gzip')
df.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,EnclosedPorch,GarageArea,...,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,SalePrice
0,662,1392,0,2,3,0,6,1392,0,576,...,120,216,0,3,6,1392,0,1968,1968,110000
1,1187,1624,0,2,3,1456,2,168,0,757,...,89,0,114,5,8,1624,0,1994,1995,262000
2,1305,1652,0,2,3,1572,2,80,0,840,...,108,300,102,5,9,1652,0,2006,2007,325000
3,945,1188,561,3,3,1088,3,0,244,456,...,98,0,0,6,5,1088,0,1890,1996,124900
4,269,1113,0,3,3,751,1,392,0,504,...,70,174,30,7,6,1143,0,1976,1976,148000


## Data Exploration
Before exploring data and doing transformations, as we decided earlier, we will select these features:

In [None]:
from ydata_profiling import ProfileReport

pandas_report = ProfileReport(df, minimal=True)
pandas_report.to_notebook_iframe()

## Features Engineering

### Functions for transforming

We have added extra feature - results
It will be printed out on each analysis, also at the end of FeatureEngineeringAnalysis it will be returned to us, so we can analyze it bit easier.

In [None]:
%matplotlib inline
import seaborn as sns
import warnings
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
from sklearn.preprocessing import PowerTransformer
from scipy.stats import skew, kurtosis, shapiro

sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like ydata-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    list_column_transformers = []
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    list_applied_transformers=[]
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")

    for col in [column] + list_applied_transformers:
    
        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)


        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
        '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def DiagnosticPlots_Numerical(df, variable):
    """
    Generate diagnostic plots for a numerical variable including histogram, QQ plot, and boxplot.
    Additionally, calculate and display skewness, kurtosis, and Shapiro-Wilk test p-value.

    Parameters:
        df (DataFrame): The DataFrame containing the variable.
        variable (str): The name of the variable to analyze.
    """
    # Calculate statistics
    skewness = skew(df[variable].dropna())  # Skewness
    kurtosis_value = kurtosis(df[variable].dropna())  # Kurtosis
    _, p_value = shapiro(df[variable].dropna())  # Shapiro-Wilk test

    # Create plots
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    # Set titles including statistics
    axes[0].set_title(f'Histogram\nSkewness: {skewness:.3f}')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')

    # Display Shapiro-Wilk test result on figure
    fig.suptitle(f"{variable} (Skewness: {skewness:.3f}, Kurtosis: {kurtosis_value:.3f}, Shapiro-Wilk p-value: {p_value:.3g})", fontsize=16, y=1.05)

    plt.tight_layout()
    plt.show()




def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
            f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


import numpy as np
from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler, RobustScaler, QuantileTransformer, FunctionTransformer
from feature_engine.transformation import LogTransformer, ReciprocalTransformer, BoxCoxTransformer, \
    YeoJohnsonTransformer


def FeatEngineering_Numerical(df_feat_eng, column):
    """
    Applies various feature engineering transformations to a specified column in the dataframe.

    Parameters:
        df_feat_eng (DataFrame): The dataframe containing the features for transformation.
        column (str): The specific column on which to apply the transformations.

    Returns:
        tuple: The transformed dataframe and a list of successfully applied transformation methods.
    """
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = LogTransformer(variables=[column])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = LogTransformer(variables=[column], base=10)
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = ReciprocalTransformer(variables=[column])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer (Yeo-Johnson)
    try:
        pt = PowerTransformer(method='yeo-johnson')
        df_feat_eng[f"{column}_power"] = pt.fit_transform(df_feat_eng[[column]])
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = BoxCoxTransformer(variables=[column])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = YeoJohnsonTransformer(variables=[column])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked



## Inspecting dataframe for columns that might not need to be transformed


In [None]:
df_transformed = FeatureEngineeringAnalysis(df=df, analysis_type='numerical')

### Based on plots above we can select best transformations for each feature:
1. TotalBsmtSF:
* Quantile Uniform
* Sine
* Cosine
2. 1stFlrSF:
* Original Values
* Log-e
* Box_Cox
* Min_Max
* Quantile Normal
3. YearBuilt:
* Quantile Uniform
* Yeo Johnson
* Power
* Min_Max 
4. GarageArea:
* Quantile Uniform
* Sine
* Cosine
* Yeo Johnson
5. GarageYrBlt:
* Quantile Uniform
* Quantile Normal
* Cosine
6. OverallQual:
* Power
* Original Values
* Min_Max
* Max_Abs
7. GrLivArea:
* Quantile Normal
* Original Values
* Max_Abs
8. SalePrice:
* Original Values
* Power
* Min Max
* Quantile Normal

## Printing Transformation results

We will print results as dataframe, so it will be short summary of all applied transformations, then we can analyze it

In [None]:
import pandas as pd
import scipy.stats as stats

def calculate_statistics(df):
    """
    Calculate skewness, kurtosis, and the p-value of the Shapiro-Wilk test for each numeric feature.
    Also, calculate mean, median, standard deviation, minimum, and maximum.

    Parameters:
        df (DataFrame): The DataFrame with numeric features.

    Returns:
        DataFrame: A new DataFrame with each row representing a feature and its statistics.
    """
    stats_list = []  # Use a list to collect all row data

    for column in df.select_dtypes(include=[np.number]).columns:
        mean_val = df[column].mean()
        median_val = df[column].median()
        std_dev = df[column].std()
        min_val = df[column].min()
        max_val = df[column].max()
        skewness = stats.skew(df[column], nan_policy='omit')
        kurtosis_val = stats.kurtosis(df[column], nan_policy='omit')

        # Append row to list
        stats_list.append({
            'Feature': column,
            'Mean': mean_val,
            'Median': median_val,
            'Std Dev': std_dev,
            'Min': min_val,
            'Max': max_val,
            'Skewness': skewness,
            'Kurtosis': kurtosis_val,
        })

    # Convert list of dictionaries to DataFrame
    stats_df = pd.DataFrame(stats_list)
    return stats_df

statistics_df = calculate_statistics(df_transformed)

In [None]:
statistics_df.head()

### Now we will make code to give us at least 3 best transformation for each feature

In [None]:
def select_best_transformations(df, n_transformations=3):
    """
    Select the best transformations for each feature based on the lowest absolute skewness and 
    the highest Shapiro-Wilk p-value.

    Parameters:
        df (DataFrame): A DataFrame with columns 'Feature', 'Skewness', 'Kurtosis', 
                        'Shapiro-Wilk p-value', and possibly others.
        n_transformations (int): The number of top transformations to select for each feature.

    Returns:
        DataFrame: A DataFrame with the top transformations for each feature.
    """
    # Split 'Feature' into 'Base Feature' and 'Transformation Method'
    df[['Base Feature', 'Transformation Method']] = df['Feature'].str.split('_', n=1, expand=True)
    df['Transformation Method'].fillna('Original', inplace=True)  # Handle cases with no underscore

    # Create a new column for absolute skewness to sort by it
    df['Abs Skewness'] = df['Skewness'].abs()

    # Sort the DataFrame by absolute skewness
    sorted_df = df.sort_values(by=['Abs Skewness'], ascending=[True])

    # Group by the base feature and select the top transformations
    top_transformations = sorted_df.groupby('Base Feature').head(n_transformations).reset_index(drop=True)

    # Return the DataFrame sorted by 'Base Feature'
    return top_transformations.sort_values(by='Base Feature')[['Base Feature', 'Skewness', 'Kurtosis', 'Transformation Method']]

best_transforms = select_best_transformations(statistics_df)


In [None]:
best_transforms

### Based on Selection above, if they do not exsis in previous list, we will add them:
1. TotalBsmtSF:
* Quantile Uniform
* Sine
* Cosine
2. 1stFlrSF:
* Original Values
* Log-e
* Box Cox
* Min Max
* Quantile Normal
* Quantile Uniform
* Max Abs
3. YearBuilt:
* Quantile Uniform
* Yeo Johnson
* Power
* Min Max
* Reciprocal 
4. GarageArea:
* Quantile Uniform
* Sine
* Cosine
* Yeo Johnson
* Original Values
5. GarageYrBlt:
* Quantile Uniform
* Quantile Normal
* Cosine
6. OverallQual:
* Power
* Original Values
* Min Max
* Max Abs
* Robust
7. GrLivArea:
* Quantile Normal
* Original Values
* Max Abs
* Quantile Uniform
8. SalePrice:
* Original Values
* Power
* Min Max
* Quantile Normal
* Quantile Uniform

## Conclusions:

We will add these transformations to ML Pipeline:

- **Original Values** : [`1stFlrSF`, `GarageArea`, `OverallQual`, `GrLivArea`, `SalePrice`]
- **Quantile Uniform** : [`TotalBsmtSF`, `1stFlrSF`, `YearBuilt`, `GarageArea`, `GarageYrBlt`, `GrLivArea`, `SalePrice`]
- **Sine** : [`TotalBsmtSF`, `GarageArea`]
- **Cosine** : [`TotalBsmtSF`, `GarageArea`, `GarageYrBlt`]
- **Log-e** : [`1stFlrSF`]
- **Box Cox** : [`1stFlrSF`]
- **Min Max** : [`1stFlrSF`, `YearBuilt`, `OverallQual`, `SalePrice`]
- **Quantile Normal** : [`1stFlrSF`, `GarageYrBlt`, `GrLivArea`, `SalePrice`]
- **Max Abs** : [`1stFlrSF`, `OverallQual`, `GrLivArea`]
- **Power** : [`YearBuilt`, `OverallQual`, `SalePrice`]
- **Yeo Johnson** : [`YearBuilt`, `GarageArea`]
- **Reciprocal** : [`YearBuilt`]
- **Robust** : [`OverallQual`]

## Next step is ML model creation and it's evaluation