# **Feature Engineering Notebook**

## Objectives

* Engineer features for Regression model

## Inputs

* outputs/datasets/cleaned/TrainSetCleaned.csv
* outputs/datasets/cleaned/TestSetCleaned.csv 


## Outputs

* Generate a list with variables to engineer 

## Conclusions

* Feature Engineering Transformers
  * Ordinal categorical encoding
  * Numerical Variable Transformation - Numerical transformer
  * Outlier Transformer - Winzorizer
  * Smart Correlation Selection


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Cleaned Data

Train Set

In [None]:
import pandas as pd
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(5)

Test Set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(5)

## Data Exploration

We evaluate which potential transformation could be used on the variables

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

Investigation from the Panda Profiling shows:

* We have 4 categorical - BsmtExposure, BsmtFinType1, GarageFinish, KitchenQual. The approach here uses **Ordinal Encoder** from the Categorical Variable Encoding transformer to transform each class to numbers.

* The remaining 20 numerical variables will use **Numerical Variable Transformation** to transform the variable distribution, ideally to become close to a normal distribution.

* Most of the variables have distribution with some outliers. We will assess engineered variables distribution using **Winzorizer** to cap outliers.

## Correlation and PPS Analysis

We perform another Correlation and PPS analysis on the cleaned datasets.

A custom functions where a combined correlation analysis (Pearson and Spearman) and PPS is used Hence, there will be need to call two functions:

CalculateCorrAndPPS(): calculate correlation tables and PPS table for a dataset. Then prints Q1 and Q3 for PPS scores
DisplayCorrAndPPS(): which takes the following arguments; df_corr_pearson, df_corr_spearman, pps_matrix and the visualization threshold for correlations and PPS (CorrThreshold and PPS_Threshold) respectively.
We build heatmaps for PPS and Pearson and Spearman correlation

Note: The custom functions was taken from the Code Institute lesson on Exploratory Data Analysis Tools on "Predictive Power Score Unit 1".

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Call CalculateCorrAndPPS to calculate the Correlation levels and PPS Scores matrix on the Train set

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(TrainSet)

The PPS threshold shows there is strong predictive power in range of 0 and 0.05. The correlation threshold will be set to 0.6 (strong correlation) and a PPS threshold of 0.15, which I think is a moderate correlation threshold.

In [None]:
DisplayCorrAndPPS(df_corr_pearson=df_corr_pearson,
                  df_corr_spearman=df_corr_spearman, 
                  pps_matrix=pps_matrix,
                  CorrThreshold=0.6, PPS_Threshold=0.15,
                  figsize=(10,10), font_annot=8)

Some changes occured in the Correlation level and the PPS analysis when compare the data cleaning notebook using the same threshold:

* Raw Data Analysis - Spearman correlation: suggests that Sales Price has moderate Correlation with GarageArea, TotalBsmtSF and YearBuilt, while Sales Price have strong monotonic relationships with GrLivArea and OverallQual.

* Cleaned Data Analysis: While most of these variables have no major changes, except TotalBsmtSF now have 1 has the correlation with sales price in the cleaned data. 

* Raw Data Analysis - Pearson correlation: Plot shows that OverallQual, GrLivArea, GarageArea, 1stFlrSF and TotalBsmtSF have strong to moderate linear relationships with Sales Price.

* Cleaned Data Analysis: While most of these variables have no major changes, except TotalBsmtSF have made slight improvement to the correlation with sales price in the cleaned data. However, still moderate correlation.

  * 1stFlrSF still show high correlation level with TotalBsmtSF
  * GarageYrBlt still show high correlation level with YearBuilt
  * YearRemodAd shows high correlation level with YearBuilt, KitchenQual and GarageYrBlt
  * Correlation level amongst variables have drop slightly in the clean data

* Raw Data Analysis - PPS Heatmap: A PPs greater than 0.2 usually means a strong predictive power. The Plot shows that KitchenQual OverallQual and YearBuilt have strong predictive power with Sales Price, while GarageArea and GarageFinish having some predictive power but weak.

* Cleaned Data Analysis - PPS Heatmap: KitchenQual OverallQual and YearBuilt remains the same while, GarageArea has now gain some stronger predictive power. 
  * GarageFinish has lost its weak predictive power. GarageYrBlt now having some predictive power but weak.

---

# Feature Engineering

We implement the custom function in the Code Institute feature-engine lesson material and Churnometer walkthrough project in the  feature engineering process.

In [None]:
%matplotlib inline
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
sns.set(style="whitegrid")
warnings.filterwarnings('ignore')


def FeatureEngineeringAnalysis(df, analysis_type=None):
    """
    - used for quick feature engineering on numerical and categorical variables
    to decide which transformation can better transform the distribution shape
    - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions
    """
    check_missing_values(df)
    allowed_types = ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
    check_user_entry_on_analysis_type(analysis_type, allowed_types)
    list_column_transformers = define_list_column_transformers(analysis_type)

    # Loop in each variable and engineer the data according to the analysis type
    df_feat_eng = pd.DataFrame([])
    for column in df.columns:
        # create additional columns (column_method) to apply the methods
        df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
        for method in list_column_transformers:
            df_feat_eng[f"{column}_{method}"] = df[column]

        # Apply transformers in respective column_transformers
        df_feat_eng, list_applied_transformers = apply_transformers(
            analysis_type, df_feat_eng, column)

        # For each variable, assess how the transformations perform
        transformer_evaluation(
            column, list_applied_transformers, analysis_type, df_feat_eng)

    return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
    """ Check analysis type """
    if analysis_type is None:
        raise SystemExit(
            f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
    if analysis_type not in allowed_types:
        raise SystemExit(
            f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
    if df.isna().sum().sum() != 0:
        raise SystemExit(
            f"There is a missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
    """ Set suffix columns according to analysis_type"""
    if analysis_type == 'numerical':
        list_column_transformers = [
            "log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]

    elif analysis_type == 'ordinal_encoder':
        list_column_transformers = ["ordinal_encoder"]

    elif analysis_type == 'outlier_winsorizer':
        list_column_transformers = ['iqr']

    return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
    for col in df_feat_eng.select_dtypes(include='category').columns:
        df_feat_eng[col] = df_feat_eng[col].astype('object')

    if analysis_type == 'numerical':
        df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(
            df_feat_eng, column)

    elif analysis_type == 'outlier_winsorizer':
        df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(
            df_feat_eng, column)

    elif analysis_type == 'ordinal_encoder':
        df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(
            df_feat_eng, column)

    return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
    # For each variable, assess how the transformations perform
    print(f"* Variable Analyzed: {column}")
    print(f"* Applied transformation: {list_applied_transformers} \n")
    for col in [column] + list_applied_transformers:

        if analysis_type != 'ordinal_encoder':
            DiagnosticPlots_Numerical(df_feat_eng, col)

        else:
            if col == column:
                DiagnosticPlots_Categories(df_feat_eng, col)
            else:
                DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
    plt.figure(figsize=(4, 3))
    sns.countplot(data=df_feat_eng, x=col, palette=[
                  '#432371'], order=df_feat_eng[col].value_counts().index)
    plt.xticks(rotation=90)
    plt.suptitle(f"{col}", fontsize=30, y=1.05)
    plt.show()
    print("\n")


def DiagnosticPlots_Numerical(df, variable):
    fig, axes = plt.subplots(1, 3, figsize=(12, 4))
    sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0])
    stats.probplot(df[variable], dist="norm", plot=axes[1])
    sns.boxplot(x=df[variable], ax=axes[2])

    axes[0].set_title('Histogram')
    axes[1].set_title('QQ Plot')
    axes[2].set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30, y=1.05)
    plt.tight_layout()
    plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
    list_methods_worked = []
    try:
        encoder = OrdinalEncoder(encoding_method='arbitrary', variables=[
                                 f"{column}_ordinal_encoder"])
        df_feat_eng = encoder.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_ordinal_encoder")

    except Exception:
        df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
    list_methods_worked = []

    # Winsorizer iqr
    try:
        disc = Winsorizer(
            capping_method='iqr', tail='both', fold=1.5, variables=[f"{column}_iqr"])
        df_feat_eng = disc.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_iqr")
    except Exception:
        df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)

    return df_feat_eng, list_methods_worked


def FeatEngineering_Numerical(df_feat_eng, column):
    list_methods_worked = []

    # LogTransformer base e
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_e"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_e")
    except Exception:
        df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    # LogTransformer base 10
    try:
        lt = vt.LogTransformer(variables=[f"{column}_log_10"], base='10')
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_log_10")
    except Exception:
        df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

    # ReciprocalTransformer
    try:
        rt = vt.ReciprocalTransformer(variables=[f"{column}_reciprocal"])
        df_feat_eng = rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_reciprocal")
    except Exception:
        df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

    # PowerTransformer
    try:
        pt = vt.PowerTransformer(variables=[f"{column}_power"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_power")
    except Exception:
        df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

    # BoxCoxTransformer
    try:
        bct = vt.BoxCoxTransformer(variables=[f"{column}_box_cox"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_box_cox")
    except Exception:
        df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)

    # YeoJohnsonTransformer
    try:
        yjt = vt.YeoJohnsonTransformer(variables=[f"{column}_yeo_johnson"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{column}_yeo_johnson")
    except Exception:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)
    
    return df_feat_eng, list_methods_worked


## Feature Engineering Summary

The following is the summary of the potential feature engineering transformers considered:

* Categorical Variable Encoding - Ordinal Encoder
* Numerical Variable Transformation - Numerical transformer
* Outlier - Winzorizer
* Smart Correlation Selection

### Categorical Encoding - Ordinal: replaces categories with ordinal numbers

Select the categorical variables

In [None]:
variables_engineering= ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']
variables_engineering

 Create a DataFrame, with the variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

Create the engineered variables and apply the transformations, assess the engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

The transformation is effective, it converted all the categorical classes to numbers. Although, the distribution still remains the same.

We apply the selected transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = variables_engineering)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

Check the transformation for TrainSet and TestSet application

In [None]:
TrainSet.filter(variables_engineering).head(5)

In [None]:
TestSet.filter(variables_engineering).head(5)

### Numerical Transformation

Select the Numerical variables except the target variable

In [None]:
variables_engineering = (TrainSet.select_dtypes(include=['float', 'int']).columns.to_list())
variables_engineering.remove('SalePrice')
variables_engineering

Create a DataFrame, with the variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(5)

Create the engineered variables and apply the transformations, assess the engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

Variable transformation distribution analysis - Assessed variables distribution to select the most suitable method

Variables :

1. **1stFlrSF:** For 1stFlrSF, it was possible to apply  all the numerical Transformer. They all show similar results in normalizing the data.

**Conclusion** - log_e have been selected for this transformation.

2. **2ndFlrSF:** Only power and Yeo-Johnson were applied to this variables, both did not normalize the data. As the variable majorly contains zeros

**Conclusion** - No transformer selected

3. BedroomAbvGr: No improvement seen in distribution from any of the transformed variable.
**Conclusion** - No transformer selected

4. BsmtExposure: No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

5. BsmtFinSF1: Only power and Yeo-Johnson were applied to this variables, slight improvement in data distribution. Although only presence of lower boundry and no outliers in the Boxplot, indicating lost of some extreme value.

**Conclusion** - Power have been selected for this transformation.

6. **BsmtFinType1**: No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

7. **BsmtUnfSF**: Both power and Yeo-Johnson made slight improvement with the data distribution. Though there was no presence of outliers, both transformers didn't help the distribution, but help with the abnormalty.

**Conclusion**- Power have been selected for this transformation.

8. **EnclosedPorch:**  Only power and Yeo-Johnson were applied to this variables. No improvement seen in distribution from the transformation.

**Conclusion** - No transformer selected

9. **GarageArea:** Only Power and Yeo-Johnson were applied to this variables. Both power and Yeo-Johnson made slight improvement with the data distribution.

**Conclusion** - power have been selected for this transformation.

10. **GarageFinish:**  EnclosedPorch:  Only power and Yeo-Johnson were applied to this variables. No improvement seen in distribution from the transformation.

**Conclusion** - No transformer selected

11. **GarageYrBlt:** It was possible to apply all the numerical Transformer. They made slight improvement to the data distribution seen in 'yeo_johnson' and 'box_cox'.

**Conclusion** - yeo_johnson have been selected for this transformation.

12. **GrLivArea:** It was possible to apply all the numerical Transformer. Some improvement seen in the distribution and the QQ Plot diagonal line of the transformer except for reciprocal transformer.

**Conclusion** - log_e have been selected for this transformation.

13. **KitchenQual:** Only power and Yeo-Johnson were applied to this variables. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

14. **LotArea:** It was possible to apply all the numerical Transformer. Some improvement seen in the distribution and the QQ Plot diagonal line of the transformer except for reciprocal transformer.

**Conclusion** - log_e have been selected for this transformation.

15. **LotFrontage:** It was possible to apply all the numerical Transformer. Some improvement seen in the distribution. They all show similar results in normalizing the data.

**Conclusion** - yeo_johnson have been selected for this transformation.

16. **MasVnrArea:** Only power and Yeo-Johnson were applied to this variables. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

17. **OpenPorchSF:** Only power and Yeo-Johnson were applied to this variables. Slight improvement in data distribution from both transformation. Although only presence of lower boundry and no outliers in the Boxplot, indicating lost of some extreme value.

**Conclusion** - yeo_johnson have been selected for this transformation.

18. **OverallCond:** It was possible to apply all the numerical Transformer. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

19. **OverallQual:** It was possible to apply all the numerical Transformer. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

20. **TotalBsmtSF:** Only power and Yeo-Johnson were applied to this variables. Some improvement seen in the distribution. They all show similar results in normalizing the data.

**Conclusion** - yeo_johnson have been selected for this transformation.

21. **WoodDeckSF:** Only power and Yeo-Johnson were applied to this variables. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

22. **YearBuilt:** It was possible to apply all the numerical Transformer. Both box_cox and Yeo-Johnson made slight improvement with the data distribution. Though there was no presence of outliers, both transformers didn't help the distribution, but help with the abnormalty.

**Conclusion** - yeo_johnson have been selected for this transformation.


23. **YearRemodAdd:** It was possible to apply all the numerical Transformer. No improvement seen in distribution from any of the transformed variable.

**Conclusion** - No transformer selected

## Selected Transformers

### Log Transformer (base e )

In [None]:
# Selected variables for LogTransformer
vars_engineering_log = ['1stFlrSF', 'GrLivArea', 'LotArea']
vars_engineering_log

Apply the selected transformation to the Train and Test set

In [None]:
transformed_var = vt.LogTransformer(variables=vars_engineering_log)
TrainSet = transformed_var.fit_transform(TrainSet)
TestSet = transformed_var.transform(TestSet)

### Power Transformer

In [None]:
# Selected variables for PowerTransformer
vars_engineering_power =  ['BsmtFinSF1', 'BsmtUnfSF', 'GarageArea']
vars_engineering_power

Apply the selected transformation to the Train and Test set

In [None]:
transformed_var = vt.PowerTransformer(variables=vars_engineering_power)
TrainSet = transformed_var.fit_transform(TrainSet)
TestSet = transformed_var.transform(TestSet)

### Yeo Johnson Transformer

In [None]:
# Selected variables for Yeo-JohnsonTransformer
vars_engineering_yj = ['GarageYrBlt', 'LotFrontage', 'OpenPorchSF', 'TotalBsmtSF', 'YearBuilt']
vars_engineering_yj

Apply the selected transformation to the Train and Test set

In [None]:
transformed_var = vt.YeoJohnsonTransformer(variables=vars_engineering_yj)
TrainSet = transformed_var.fit_transform(TrainSet)
TestSet = transformed_var.transform(TestSet)

## Outlier Transformation

We apply all numerical variables except the target

In [None]:
variables_engineering = (TrainSet.select_dtypes(include=['float', 'int']).columns.to_list())
variables_engineering.remove('SalePrice')
variables_engineering

Create a DataFrame, with the variables

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(5)

Create the engineered variables and apply the transformations, assess the engineered variables distribution and select the most suitable method for each variable.

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='outlier_winsorizer')

Variable transformation distribution analysis - Assess variables distribution to select the most suitable method

Variables Selected: 

In [None]:
# Selected variables for Outlier Transformer
vars_engineering_outlier = ['1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage', 'TotalBsmtSF']
vars_engineering_outlier

Apply the selected transformation to the Train and Test set

In [None]:
transformed_var = Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables=vars_engineering_outlier)

TrainSet = transformed_var.fit_transform(TrainSet)
TestSet = transformed_var.transform(TestSet)

## Smart Correlated Selection Variables

Finds groups of correlated features and then selects, from each group for evaluation

For this transformer, we apply all numerical variables except the target variable to the transformer - the aim is to leave the target variable untouched as it represents the outcome or dependent variable we are trying to predict or understand.

In [None]:
variables_engineering = (TrainSet.select_dtypes(include=['float', 'int']).columns.to_list())
variables_engineering.remove('SalePrice')
variables_engineering

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(5)

We set the method as Pearson, the threshold as 0.7 and selection_method as the variance. A threshold of 0.7 means that any variable correlations that are above the threshold, will be considered and subject to removal

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.7, selection_method="variance")

# We can check which sets of features were marked as correlated 
corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
# We check which variables were removed
corr_sel.features_to_drop_

---

## **Conclusions and Next steps:** 

The list below shows the transformations needed for feature engineering which will be added to the ML Pipeline.

Feature Engineering Transformers

* Ordinal Encoder: 'BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'

* Numerical Transformers:
  * Log Transformer (base e): '1stFlrSF', 'GrLivArea', 'LotArea'
  * Power Transformer: 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea'
  * Yeo Johnson Transformer: 'GarageYrBlt', 'LotFrontage', 'OpenPorchSF', 'TotalBsmtSF', 'YearBuilt'
  
* Outlier Transformer:
   * Winzorizer: '1stFlrSF', 'GrLivArea', 'LotArea', 'LotFrontage', 'TotalBsmtSF'

* Smart Correlated Selection: '1stFlrSF', 'GarageYrBlt', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd'
* Dropped Variables: '1stFlrSF', 'YearBuilt', 'YearRemodAdd'

---

*  Clear the outputs, and move on to the following notebook.