# Notebook 04 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/train.parquet.gzip

## Outcome:

Selected Features and Transformations for them

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df_train = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df_train.head()

## Data Exploration
Before exploring data and doing transformations, as we decided earlier, we will select these features:

In [None]:
hypothesis_1_features= ["BsmtFinType1", "KitchenQual", "OverallQual", "GarageFinish", "BsmtExposure", "GrLivArea", "GarageArea", "YearBuilt", "1stFlrSF", "TotalBsmtSF", "SalePrice"]
# In dataframe keeping just selected features
df_train = df_train[hypothesis_1_features]
df_train.head()

## Features Engineering

### Functions for transforming

We have added extra feature - results
It will be printed out on each analysis, also at the end of FeatureEngineeringAnalysis it will be returned to us, so we can analyze it bit easier.

In [None]:
from feature_engine.encoding import OrdinalEncoder

def feat_engineering_categorical_encoder(df_feat_eng):
    """
    Applies ordinal encoding to all categorical columns in the DataFrame.
    
    Parameters:
        df_feat_eng (pd.DataFrame): The DataFrame to transform.
    
    Returns:
        pd.DataFrame: The transformed DataFrame.
    """
    # Detect categorical columns in the DataFrame
    categorical_columns = df_feat_eng.select_dtypes(include=['object', 'category']).columns.tolist()

    # Apply ordinal encoding to each categorical column
    encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_columns)
    try:
        df_feat_eng = encoder.fit_transform(df_feat_eng)
    except Exception as e:
        print(f"Error encoding columns {categorical_columns}: {e}")
        # In case of failure, drop the columns intended for encoding
        df_feat_eng.drop(columns=categorical_columns, inplace=True)

    return df_feat_eng

In [None]:
def apply_transformation(transformer, df, column_name):
    """
    Applies a given transformer to the DataFrame and handles exceptions.
    
    Parameters:
        transformer (vt.BaseNumericalTransformer): The transformer to apply.
        df (pd.DataFrame): The DataFrame to transform.
        column_name (str): The name of the column to transform.
    
    Returns:
        pd.DataFrame: The transformed DataFrame.
        bool: Whether the transformation was successful.
    """
    try:
        df = transformer.fit_transform(df)
        return df, True
    except Exception as e:
        if column_name in df.columns:
            df.drop([column_name], axis=1, inplace=True)
        return df, False

def feat_engineering_numerical(df_feat_eng, columns=None):
    """
    Applies various numerical transformations to given columns or the entire DataFrame.
    
    Parameters:
        df_feat_eng (pd.DataFrame): The DataFrame to transform.
        columns (list or None): The list of columns to transform. If None, transforms all numerical columns.
    
    Returns:
        pd.DataFrame: The transformed DataFrame.
    """
    if columns is None:
        # If no columns are specified, transform all numerical columns
        columns = df_feat_eng.select_dtypes(include='number').columns.tolist()

    # Define transformations and their corresponding column suffixes
    transformations = {
        "log_e": vt.LogTransformer,
        "log_10": lambda: vt.LogTransformer(base='10'),
        "reciprocal": vt.ReciprocalTransformer,
        "power": vt.PowerTransformer,
        "box_cox": vt.BoxCoxTransformer,
        "yeo_johnson": vt.YeoJohnsonTransformer
    }

    # Apply each transformation to each column
    for column in columns:
        for suffix, transformer_class in transformations.items():
            column_name = f"{column}_{suffix}"
            transformer = transformer_class(variables=[column_name])
            df_feat_eng, _ = apply_transformation(transformer, df_feat_eng, column_name)

    return df_feat_eng

## Categorical Encoding

Before We proceed with Deeper analysis, first we have to do:
1. Get list of features that are categorical
2. Get list of features that are numerical

Quick peek to Categorical Features

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def diagnostic_plots_categories(df):
    """
    Creates and displays diagnostic plots for all categorical features in the DataFrame.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
    
    Returns:
        None
    """
    # Detect categorical columns in the DataFrame
    categorical_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()

    for col in categorical_columns:
        # Determine the number of unique categories in the column
        num_categories = df[col].nunique()

        # Generate a palette with different colors for each category
        palette = sns.color_palette("husl", num_categories)

        plt.figure(figsize=(10, 6))  # Set up the plotting area
        sns.countplot(
            data=df,
            x=col,
            palette=palette,  # Use the generated palette
            order=df[col].value_counts().index
        )
        plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
        plt.title(f"{col}", fontsize=20, y=1.05)  # Add a title with increased font size
        plt.show()  # Display the plot
        print("\n")  # Print a newline for spacing in console output

diagnostic_plots_categories(df_train)


In [None]:
# Getting list of categorical features

categorical_features = df_train.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features

In [None]:
# Getting list of numerical features
numerical_features = df_train.select_dtypes(include=['int', 'float']).columns.tolist()
numerical_features

#### Encoding Categorical Features and exploring distribution

In [None]:
df_train_categorical_encoded = feat_engineering_categorical_encoder(df_train[categorical_features])
df_train_categorical_encoded

Lets take a look how categorical after Encoding

In [None]:
import scipy.stats as stats

def diagnostic_plots_numerical(df):
    """
    Creates and displays diagnostic plots for all numerical features in the DataFrame,
    and annotates skewness, kurtosis, and mean inside the plots.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
    
    Returns:
        None
    """
    # Detect numerical columns in the DataFrame
    numerical_columns = df.select_dtypes(include=['number']).columns.tolist()

    for variable in numerical_columns:
        # Calculate statistics
        mean_val = df[variable].mean()
        skew_val = df[variable].skew()
        kurt_val = df[variable].kurtosis()

        # Set up the plotting area with three subplots
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))  # Increased figure size for better readability

        # Histogram with KDE
        sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0], color='#1f77b4')
        axes[0].set_title('Histogram', fontsize=15)
        axes[0].set_xlabel(variable, fontsize=12)
        axes[0].set_ylabel('Frequency', fontsize=12)
        # Add text annotation
        axes[0].text(0.95, 0.95, f'Mean: {mean_val:.2f}\nSkew: {skew_val:.2f}\nKurtosis: {kurt_val:.2f}',
                     verticalalignment='top', horizontalalignment='right',
                     transform=axes[0].transAxes, fontsize=12, bbox=dict(facecolor='white', alpha=0.8))

        # QQ Plot
        stats.probplot(df[variable], dist="norm", plot=axes[1])
        axes[1].get_lines()[1].set_color('#ff7f0e')  # Change the color of the QQ line
        axes[1].set_title('QQ Plot', fontsize=15)
        axes[1].set_xlabel('Theoretical Quantiles', fontsize=12)
        axes[1].set_ylabel('Sample Quantiles', fontsize=12)
        

        # Boxplot
        sns.boxplot(x=df[variable], ax=axes[2], color='#2ca02c')
        axes[2].set_title('Boxplot', fontsize=15)
        axes[2].set_xlabel(variable, fontsize=12)
        

        # Overall title for the figure
        fig.suptitle(f"Diagnostic Plots for {variable}", fontsize=20, y=1.05)

        # Adjust layout for better spacing
        plt.tight_layout()
        plt.subplots_adjust(top=0.85)  # Adjust top spacing to make room for the main title

        # Display the plots
        plt.show()
        print("\n")  # Print a newline for spacing in console output

diagnostic_plots_numerical(df_train_categorical_encoded)

We can not notice anything extraordinary, all looks pretty normal.

#### Inspecting dataframe Numerical Features

Transformations for numerical Features


In [None]:
import pandas as pd
from feature_engine import transformation as vt

def feat_engineering_numerical(df_feat_eng):
    """
    Applies various numerical transformations to all numerical columns in the DataFrame.
    
    Parameters:
        df_feat_eng (pd.DataFrame): The DataFrame to transform.
    
    Returns:
        pd.DataFrame: The DataFrame with original and transformed numerical columns.
    """
    # Create a deep copy of the DataFrame to avoid SettingWithCopyWarning
    df_feat_eng_copy = df_feat_eng.copy()

    # Detect numerical columns in the DataFrame
    numerical_columns = df_feat_eng_copy.select_dtypes(include='number').columns.tolist()

    # Define transformations and their corresponding column suffixes
    transformations = {
        "log_e": vt.LogTransformer(),
        "log_10": vt.LogTransformer(base='10'),
        "reciprocal": vt.ReciprocalTransformer(),
        "power": vt.PowerTransformer(),
        "box_cox": vt.BoxCoxTransformer(),
        "yeo_johnson": vt.YeoJohnsonTransformer()
    }

    # Iterate over each numerical column and apply each transformation
    for column in numerical_columns:
        for suffix, transformer in transformations.items():
            new_column_name = f"{column}_{suffix}"
            transformer.variables = [column]  # Set the variables attribute dynamically
            try:
                # Apply transformation and assign to new column in the copy DataFrame
                df_feat_eng_copy[new_column_name] = transformer.fit_transform(df_feat_eng_copy[[column]])
            except Exception as e:
                # Print error message with details if transformation fails
                print(f"Error applying {transformer.__class__.__name__} to {new_column_name}: {e}")

    return df_feat_eng_copy

In [None]:
df_train_numerical_transformed = feat_engineering_numerical(df_train[numerical_features])
df_train_numerical_transformed

In [None]:
diagnostic_plots_numerical(df_train_numerical_transformed)

### Based on plots above we can select best transformations for each feature:

1. OverallQual: Log-e, Power, Original Values
2. GrLivArea: Log-e, Box Cox, Yeo Johnson, Original Values
3. 1stFlrSF: Log-e, Box_Cox, Yeo Johnson, Power, Original Values
4. TotalBsmtSF: Power, Yeo Johnson
5. GarageArea: Power, Original Values
6. YearBuilt: Log-e, Power, Box Cox

### Selection of Transformations and further actions

To make an easier decision, we will use ydata profiling see all information we need
for that we will make a list of Features we want to explore

In [None]:
selected_items = ['OverallQual', 'OverallQual_log_e', 'OverallQual_power', 'GrLivArea', 'GrLivArea_log_e', 'GrLivArea_box_cox', 'GrLivArea_yeo_johnson', '1stFlrSF', '1stFlrSF_log_e', '1stFlrSF_power', '1stFlrSF_box_cox', '1stFlrSF_yeo_johnson', 'TotalBsmtSF_power', 'TotalBsmtSF_yeo_johnson', 'GarageArea', 'GarageArea_power', 'YearBuilt_box_cox', 'YearBuilt_log_e', 'YearBuilt_power', 'SalePrice', 'SalePrice_log_e', 'SalePrice_log_10', 'SalePrice_box_cox', 'SalePrice_yeo_johnson']

In [None]:
from ydata_profiling import ProfileReport

hypothesis_1_transformations_profile = ProfileReport(df_train_numerical_transformed[selected_items], minimal=True)
hypothesis_1_transformations_profile.to_notebook_iframe()

### Final selection of transformation for features:

First lets check, maybe our target needs transformations:

Features:
1. OverallQual - Original Values. Somehow in Profile Report, OverallQual after transformations, skew values are different from plots we have observed. At this point i will trust more Profile Report than my coding for transformations.
2. GrLivArea - Box Cox. Kurtosis. Skewness = 0.0007368103, kurtosis = 0.12902342, overall distribution looks good and no extreme values
3. 1sfFlrSF - Yeo Johnson. Skewness is just 0.00060557042, while kurtosis is not inspiring = -0.11443746, extreme values not noticed
4. TotalBsmtSF - Yeo Johnson. Skewness -0.028003628, kurtosis 1.7998654, we have a bit of wide range of values from 0 and 49.42 up to 752.46. might need normalization 
5. GarageArea - Original Values, skewness is just 0.11419629 and kurtosis 0.8197573, also noticed extreme values from 0 till 1390
6. YearBuilt - Box Cox, skewness -0.13552187 and kurtosis is still negative -1.2199092. We have very high values, lowest is with e69, highest with e70, we will simply divide it by 1e69

## Outcome

* Ordinal Encoder: ['BsmtFinType1', 'KitchenQual', 'BsmtExposure', 'GarageFinish']
* Numerical transformations:
    * Box Cox: ['GrLivArea', 'YearBuilt', 'SalePrice']
    * Yeo Johnson: ['1stFlrSF', TotalBsmtSF']
    *  'YearBuilt' divide by 1e69
* Original Values: ['OverallQual', 'GarageArea']
