# Notebook 06 - Feature Engineering

## Objectives

Engineer Features for:
* Classification
* Regression
* Clustering

## Inputs
* outputs/datasets/cleaned/train.parquet.gzip

## Outcome:

All Features and Transformations for them

## Change working directory
In This section we will get location of current directory and move one step up, to parent folder, so App will be accessing project folder.

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os

current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("you have set a new current directory")

Confirm new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Loading Dataset

In [None]:
import pandas as pd

df = pd.read_parquet('outputs/datasets/cleaned/train.parquet.gzip')
df.head()

## Data Exploration
After FIASKO with Hypothesis 1, we need to explore all features.

Issues: we have so many features to explore, what will be Hard to check, with all transformation.

Goal: 
1. Encode all Categorical features with Ordinal Encoder
2. Now all Features are numerical, so we will transform all of them with all numerical transformations
3. Create functionality:
* Filter all new transformed features by original name
* In loop keep dropping new created features - transformations with not good parameters:
    * Skewness
    * Kurtosis
    * Check if Correlation has changed to SalePrice and other features

## Feature Engineering

Functions for encoding, Plot functionality is removed to speed up process

In [None]:
import pandas as pd
from feature_engine import transformation as vt

def feat_engineering_numerical(df_feat_eng):
    """
    Applies various numerical transformations to all numerical columns in the DataFrame.
    
    Parameters:
        df_feat_eng (pd.DataFrame): The DataFrame to transform.
    
    Returns:
        pd.DataFrame: The DataFrame with original and transformed numerical columns.
    """
    # Create a deep copy of the DataFrame to avoid SettingWithCopyWarning
    df_feat_eng_copy = df_feat_eng.copy()

    # Detect numerical columns in the DataFrame
    numerical_columns = df_feat_eng_copy.select_dtypes(include='number').columns.tolist()

    # Define transformations and their corresponding column suffixes
    transformations = {
        "log_e": vt.LogTransformer(),
        "log_10": vt.LogTransformer(base='10'),
        "reciprocal": vt.ReciprocalTransformer(),
        "power": vt.PowerTransformer(),
        "box_cox": vt.BoxCoxTransformer(),
        "yeo_johnson": vt.YeoJohnsonTransformer()
    }

    # Iterate over each numerical column and apply each transformation
    for column in numerical_columns:
        for suffix, transformer in transformations.items():
            new_column_name = f"{column}_{suffix}"
            transformer.variables = [column]  # Set the variables attribute dynamically
            try:
                # Apply transformation and assign to new column in the copy DataFrame
                df_feat_eng_copy[new_column_name] = transformer.fit_transform(df_feat_eng_copy[[column]])
            except Exception as e:
                # Print error message with details if transformation fails
                print(f"Error applying {transformer.__class__.__name__} to {new_column_name}: {e}")

    return df_feat_eng_copy

Getting List of Categorical Features

In [None]:
categorical_features = df.select_dtypes(['object', 'category']).columns.tolist()
categorical_features

## Transformations

Categorical Encoding

Checking if all columns are like from Original Dataset, if there is extra ones, we will drop them

In [None]:
feat_compare = pd.read_csv('outputs/datasets/collection/HousePricesRecords.csv').columns.tolist()
df = df[feat_compare]

In [None]:
df.columns

This is funny, what is that Unnamed: 0? let's check

In [None]:
df['Unnamed: 0']

Looks like a copy of Index. Yeah… CSV, as I mentioned in README.md - Noot good for storing, you either loose data or get too many :D
Lets drop it

In [None]:
df = df.drop(columns=['Unnamed: 0'])
orig_feat = df.columns.tolist()
orig_feat

In [None]:
import pandas as pd
import warnings
from feature_engine.encoding import OrdinalEncoder

warnings.filterwarnings('ignore')


def encode_categorical_features(df):
    """
    Encodes only the categorical features in the DataFrame while leaving other features unchanged.
    Assumes that categorical features are of type 'category' or 'object'.
    """
    # Identify categorical columns in the DataFrame
    categorical_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()

    # Apply ordinal encoding only to categorical columns
    if categorical_cols:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_cols)
        df = encoder.fit_transform(df)

    return df


df[categorical_features] = encode_categorical_features(df)[categorical_features]
df.head()

In [None]:
df_train_numerical_transformed = feat_engineering_numerical(df)
df_train_numerical_transformed.head()

### Transformations Evaluation

As in previous Hypothesis, I will plot all transformations

In [None]:
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt

def diagnostic_plots_numerical(df):
    """
    Creates and displays diagnostic plots for all numerical features in the DataFrame,
    and annotates skewness, kurtosis, and mean inside the plots.
    
    Parameters:
        df (pd.DataFrame): The DataFrame containing the data.
    
    Returns:
        None
    """
    # Detect numerical columns in the DataFrame
    numerical_columns = df.select_dtypes(include=['number']).columns.tolist()

    for variable in numerical_columns:
        # Calculate statistics
        mean_val = df[variable].mean()
        skew_val = df[variable].skew()
        kurt_val = df[variable].kurtosis()

        # Set up the plotting area with three subplots
        fig, axes = plt.subplots(1, 3, figsize=(18, 6))  # Increased figure size for better readability

        # Histogram with KDE
        sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0], color='#1f77b4')
        axes[0].set_title('Histogram', fontsize=15)
        axes[0].set_xlabel(variable, fontsize=12)
        axes[0].set_ylabel('Frequency', fontsize=12)
        # Add text annotation
        axes[0].text(0.95, 0.95, f'Mean: {mean_val:.2f}\nSkew: {skew_val:.2f}\nKurtosis: {kurt_val:.2f}',
                     verticalalignment='top', horizontalalignment='right',
                     transform=axes[0].transAxes, fontsize=12, bbox=dict(facecolor='white', alpha=0.8))

        # QQ Plot
        stats.probplot(df[variable], dist="norm", plot=axes[1])
        axes[1].get_lines()[1].set_color('#ff7f0e')  # Change the color of the QQ line
        axes[1].set_title('QQ Plot', fontsize=15)
        axes[1].set_xlabel('Theoretical Quantiles', fontsize=12)
        axes[1].set_ylabel('Sample Quantiles', fontsize=12)


        # Boxplot
        sns.boxplot(x=df[variable], ax=axes[2], color='#2ca02c')
        axes[2].set_title('Boxplot', fontsize=15)
        axes[2].set_xlabel(variable, fontsize=12)


        # Overall title for the figure
        fig.suptitle(f"Diagnostic Plots for {variable}", fontsize=20, y=1.05)

        # Adjust layout for better spacing
        plt.tight_layout()
        plt.subplots_adjust(top=0.85)  # Adjust top spacing to make room for the main title

        # Display the plots
        plt.show()
        print("\n")  # Print a newline for spacing in console output

diagnostic_plots_numerical(df_train_numerical_transformed)

### Transformations Summary

Based on Plots above, I have selected these transformations and further actions:

1. 1stFlrSF - Yeo Johnson, test with removing outliers
2. 2ndFlrSF -  Yeo Johnson
3. BedroomAbcGr - Yeo Johnson, test with removing outliers
4. BsmtExposure - Yeo Johnson
5. BsmtFinSF1 - Power, test with scaling
6. BsmtFinType1 - Yeo Johnson
7. BsmtUnfSF - Power, test with scaling, might need removing outliers
8. EnclosedPorch - Discard!!!
9. GarageArea - test with scaling and removing outliers
10. GarageFinish - Yeo Johnson, has some negative values
11. GarageYrBlt - Log_10
12. GrLivArea - Log_10, neets removing outliers before creating model
13. KithenQual - Power, h as some negative values
14. LotArea - Yeo Johnson, has lots of outliers, test with removing them or discarding feature
15. LotFrontage - Box_Cox, lots our outliers, try removing
16. MasVnrArea - Yeo Johnson, has some negative values
17. OpenPorchSF - Yeo Johnson, has some negative values
18. OverallCond - Box Cox, test removing outliers
19. OverallQual - Yeo Johnson, test removing outliers
20. TotalBsmtSF - Yeo Johnson, has negative values, test with scaling and removing outliers
21. WoodDeckSF - Discard
22. YearBuilt - Log_10
23. YearRemodAdd - Log10, no furthers preprocessing needed
24. SalePrice - Log_10, remove outliers


Let's make a copy of our transformations and see how it will perform with removing outliers

In [None]:
df_selected_transformations = df_train_numerical_transformed[
    ['1stFlrSF_yeo_johnson', '2ndFlrSF_yeo_johnson', 'BedroomAbvGr_yeo_johnson', 'BsmtExposure_yeo_johnson',
     'BsmtFinSF1_power', 'BsmtFinType1_yeo_johnson', 'BsmtUnfSF_power', 'GarageArea', 'GarageFinish_yeo_johnson',
     'GarageYrBlt_log_10', 'GrLivArea_log_10', 'KitchenQual_power', 'LotArea_yeo_johnson', 'LotFrontage_box_cox',
     'MasVnrArea_yeo_johnson', 'OpenPorchSF_yeo_johnson', 'OverallCond_yeo_johnson', 'OverallQual_yeo_johnson',
     'TotalBsmtSF_yeo_johnson', 'YearBuilt_log_10', 'YearRemodAdd_log_10', 'SalePrice_log_10']]

In [None]:
from feature_engine.outliers import OutlierTrimmer

outlier_trimmer = OutlierTrimmer(capping_method='iqr', tail='both', fold=1.5,
                                 variables=['GrLivArea_log_10'])
df_trimmed = outlier_trimmer.fit_transform(df_train_numerical_transformed)
diagnostic_plots_numerical(df_trimmed[['GrLivArea_log_10']])

Let's try out using Winsorizer, how it looks

In [None]:
# Import the necessary module
from feature_engine.outliers import Winsorizer

# Create a Winsorizer instance
winsorizer = Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables=['GrLivArea_log_10'])

# Apply Winsorization
df_winsorized = winsorizer.fit_transform(df_train_numerical_transformed)

# Generate diagnostic plots
diagnostic_plots_numerical(df_winsorized[['GrLivArea_log_10']])


Winsorizer also looks very good, and same time we can keep same quantity of features records.

## Outcome

We will be applying following transformations:

### Yeo Johnson Transformation
- **1stFlrSF**: Test with removing outliers.
- **2ndFlrSF**
- **BedroomAbcGr**: Test with removing outliers.
- **BsmtExposure**
- **BsmtFinType1**
- **GarageFinish**
- **LotArea**: Has lots of outliers, test with removing them or discarding feature.
- **MasVnrArea**
- **OpenPorchSF**
- **OverallQual**: Test removing outliers.
- **TotalBsmtSF**: Test with scaling and removing outliers.

### Log Transformation
- **GarageYrBlt**: Apply Log_10.
- **GrLivArea**: Apply Log_10, needs removing outliers before creating model.
- **YearBuilt**: Apply Log_10.
- **YearRemodAdd**: Apply Log10

### Box Cox Transformation
- **LotFrontage**: Has lots of outliers, Might need discarding feature
- **OverallCond**: Test removing outliers.

### Power Transformation
- **BsmtFinSF1**: Test with scaling.
- **BsmtUnfSF**: Test with scaling, might need removing outliers.
- **KithenQual**: Has some negative values.

### Discard Features
- **EnclosedPorch**
- **WoodDeckSF**

### Other Tests and Operations
- **GarageArea**: Test with scaling and removing outliers.

