# **Feature Engineering**

## Objectives

* Load and inspect the data prepared during data cleaning
* Data exploration
* Feature engineering
* Conclusion and next steps

## Inputs

* outputs/datasets/cleaned/train_set.csv
* outputs/datasets/cleaned/test_set.csv

## Outputs

* Write here which files, code or artifacts you generate by the end of the notebook 

## Additional Comments

* This notebook was written based on the guidelines provided in the Customer Churn walk through project, Feature Engineering lesson.
* This notebook relates to the Data Preparation step of Crisp-DM methodology

---

# Change working directory

Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Import Packages & set environment variables

In [None]:
import pandas as pd
pd.options.display.max_columns = None

---

# Load Data

* Load Train Set

In [None]:
train_set_path = "outputs/datasets/cleaned/train_set.csv"
train_set = pd.read_csv(train_set_path)
train_set.head(3)

* Load Test Set

In [None]:
test_set_path = "outputs/datasets/cleaned/test_set.csv"
test_set = pd.read_csv(test_set_path)
test_set.head(3)

---

# Data Exploration

* Run pandas profiling report to evaluate potential transformations in the data

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=train_set, minimal=True)
pandas_report.to_notebook_iframe()

---

## Feature Engineering

### Analysis and Transformation Functions

* I am using the custom function from the feature-engine lesson, to implement the feature engineering process.

In [None]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

def FeatureEngineeringAnalysis(df,analysis_type=None):
  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop in each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng,list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")


def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  transformer_column = column + '_' + analysis_type
  # Joao Paulo 
  # print(f"{df_feat_eng.set_index(transformer_column).groupby([transformer_column, column]).size()} \n")


  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(4, 3))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show()
  print("\n")


def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(12, 4))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.tight_layout()
  plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked


def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

  ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)

  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)

  return df_feat_eng,list_methods_worked

---

### Transformers to be used
* These are the transformer that will be used and will be applied in this order:
  * Categorical Encoding
  * Numerical Transformation
  * Smart Correlation Selection

### Categorical Encoding

* Replace categorical data with ordinal numbers

1. Declare a variable with the categorical variable names

In [None]:
categorical_variables = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

---

2. Create a dataframe from a subset of the Train set using the variable above

In [None]:
df_categorical = train_set[categorical_variables].copy()
df_categorical.head()

---

3. Apply the transformation to the variables and assess the distribution in order to select a suitable method for each variable

In [None]:
df_categorical_engineered = FeatureEngineeringAnalysis(df=df_categorical, analysis_type='ordinal_encoder')

#### Analysis of plots
* The transformation from categorical to numerical is effective

---

4. Apply the transformation to the Train and Test set

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = categorical_variables)
train_set = encoder.fit_transform(train_set)
test_set = encoder.transform(test_set)

print("* Categorical encoding - ordinal transformation done!")

---

## Numerical Transformation

1. Declare a variable with the numerical variable names

In [None]:
numerical_variables = ['1stFlrSF', '2ndFlrSF', 'BsmtFinSF1', 'BsmtUnfSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'LotArea', 'LotFrontage', 'MasVnrArea', 'OpenPorchSF', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']

2. Create a dataframe from a subset of the Train set using the variable above

In [None]:
df_numerical = train_set[numerical_variables].copy()
df_numerical.head()

3. Apply the transformation to the variables and assess the distribution in order to select a suitable method for each variable

In [None]:
df_numerical_engineered = FeatureEngineeringAnalysis(df=df_numerical, analysis_type='numerical')

#### Analysis of plots

* Variables Analyzed: `1stFlrSF`, `LotArea`
* Applied transformation:
  * `Log e`, `Log 10`, `Reciprocal`, `Power`, `Box Cox`, `Yeo Johnson` 
* With exception of `Reciprocal` and `Power`, the applied transformations show an improvement in terms of distribution shape and QQ plot. The transformed options show characteristics of normal distribution. 
* Conclusion:
  * `Log e`, `Log 10`, `Box Cox` and `Yeo Johnson` may be considered for numerical transformation of `1stFlrSF` and `LotArea`.

* Variables Analyzed: `2ndFlrSF`, `BsmtFinSF1`, `BsmtUnfSF`, `GarageYrBlt`, `TotalBsmtSF`
* Applied transformation:
  * `Power`, `Yeo Johnson`
* Only 2 transformations applied were `Power` and `Yeo Johnson`.
* Neither of the plots show an improvement in terms of distribution shape and QQ plot, since the transformed options
don't show characteristics of normal distribution.
* Conclusion:
  * These variables will not be considered for numerical transformation.

* Variables Analyzed: `GarageArea`, `MasVnrArea`
* Applied transformation:
  * `Power`, `Yeo Johnson`
* Only 2 transformations were applied, `Power` and `Yeo Johnson`.
* The plots for `Power` show an improvement in terms of distribution shape and QQ plot, and show characteristics of normal distribution.
* However, `Yeo Johnson` does not show any improvement.
* Conclusion:
  * `Power` may be considered for numerical transformation of `GarageArea` and `MasVnrArea`.

* Variable Analyzed: `GrLivArea`
* Applied transformation:
  * `Log e`, `Log 10`, `Reciprocal`, `Power`, `Box Cox`, `Yeo Johnson`
* With exception of `Reciprocal`, the applied transformations show an improvement in terms of distribution shape and QQ plot. The transformed options show characteristics of normal distribution. 
* Conclusion:
  * `Log e`, `Log 10`, `Power`, `Box Cox` and `Yeo Johnson` may be considered for numerical transformation of `GrLivArea`.

* Variables Analyzed: `LotFrontage`
* Applied transformation:
  * `Log e`, `Log 10`, `Reciprocal`, `Power`, `Box Cox`, `Yeo Johnson`
* `Power`, `Box Cox` and `Yeo Johnson` transformations show similar results on the distribution shape and QQ plot to that of the plot before transformation.
* `Log e`, `Log 10` and `Reciprocal` do not show improvement. 
* Conclusion:
  * This variable will not be considered for numerical transformation.

* Variable Analyzed: `OpenPorchSF`
* Applied transformation:
  * `Power`, `Yeo Johnson`
* Only 2 transformations were applied, `Power` and `Yeo Johnson`.
* The plots for `Yeo Johnson` show improvement in terms of distribution shape and QQ plot, and show characteristics of normal distribution.
* However, `Power` does not show any improvement.
* Conclusion:
  * `Yeo Johnson` may be considered for numerical transformation of `OpenPorchSF`.

* Variable Analyzed: `YearBuilt`, `YearRemodAdd`
* Applied transformation:
  * `Log e`, `Log 10`, `Reciprocal`, `Power`, `Box Cox`, `Yeo Johnson`
* Transformations on these variables offered no improvement.
* Conclusion:
  * These variables will not be considered for numerical transformation.

---

### SmartCorrelatedSelection Variables

* All variables will be used for `SmartCorrelatedSelection`

1. Create a copy of the Train set dataframe

In [None]:
df_smart_corr_selection = train_set.copy()
df_smart_corr_selection.head(3)

2. Create engineered variables(s) applying the transformation(s)
* Looking for groups of features that correlate amongst themselves

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
smart_corr_selection = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="cardinality")

smart_corr_selection.fit_transform(df_smart_corr_selection)
smart_corr_selection.correlated_feature_sets_

3. Remove any surplus correlated features since they’ll add the same information to the model.

In [None]:
smart_corr_selection.features_to_drop_

---

## Conclusions and next Steps

* Feature Engineering Transformers
  * Ordinal categorical encoding: `['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']`
  * Numerical transformation:
    * `Log e`, `Log 10`, `Box Cox` and `Yeo Johnson` may be considered for numerical transformation of `1stFlrSF` and `LotArea`.
    * `Power` may be considered for numerical transformation of `GarageArea` and `MasVnrArea`.
    * `Log e`, `Log 10`, `Power`, `Box Cox` and `Yeo Johnson` may be considered for numerical transformation of `GrLivArea`.
    * `Yeo Johnson` may be considered for numerical transformation of `OpenPorchSF`.
  * Smart Correlation Selection: `['2ndFlrSF', 'GarageArea', 'GarageYrBlt', 'OverallQual', 'TotalBsmtSF']`

---