# Feature Engineering Notebook

## Objectives

* Engineer features for Regression model

## Inputs

* outputs/datasets/cleaned/TrainSetCleaned.csv
* outputs/datasets/cleaned/TestSetCleaned.csv

## Outputs

* Generate a list of variables to engineer 


---

## Change working directory

* We use os.getcwd() to access the current directory.

In [None]:
import os
current_dir = os.getcwd()
current_dir

## Access the parent directory

* We want to make the parent of the current directory the new current directory.
    * os.path.dirname() gets the parent directory
    * os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

## Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Imports

In [None]:
import pandas as pd
# Panda's profiling report
from pandas_profiling import ProfileReport
# Used for feature engineering functions
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
# For smart correlation selection
from feature_engine.selection import SmartCorrelatedSelection



---

## Load Cleaned Data

### Train set

In [None]:
train_set_path = "outputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.head(5)

### Test Set

In [None]:
test_set_path = 'outputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.head(5)

### Check for missing values

In [None]:

missing_data_test_set = TestSet.columns[TestSet.isna().sum() > 0].to_list()
missing_data_train_set = TrainSet.columns[TrainSet.isna().sum() > 0].to_list()
missing_data = missing_data_test_set + missing_data_train_set
missing_data

---

## Data Exploration

* Generate pandas profiling report
    * We can see that many features have skewed distributions, either left or right. Some have outliers and others many values equal to zero.
    * We have both numerical and categorical features.

In [None]:
pandas_report = ProfileReport(df=TrainSet, minimal=True)
pandas_report.to_notebook_iframe()

* We will now explore some of the many potential feature engineering methods we can use to transform the features.

---

## Feature Engineering

### Custom functions

* The following functions were copied and adapted from the walkthrough project on customer churn study, feature engineering notebook.

* We use these custom functions for quick feature engineering on numerical and categorical variables to decide which transformation can be applied to better transform the distribution shape.

* After applying the transformations, we use its reporting tool, such as pandas-profiling, to evaluate distributions and see the effects.

In [None]:
def FeatureEngineeringAnalysis(df,analysis_type=None):
  """
  Used for quick feature engineering on numerical and categorical variables
  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  # Loop in each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng,list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  '''
  Check analyis type
  '''
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing value in your dataset. Please handle that before getting into feature engineering.")


def define_list_column_transformers(analysis_type):
  '''
  Set suffix colummns acording to analysis_type
  '''
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers


def apply_transformers(analysis_type, df_feat_eng, column):
  '''
  Apply transformers
  '''
  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')

  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  '''
  For each variable, assess how the transformations perform
  '''
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(4, 3))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show()
  print("\n")


def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(12, 4))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.tight_layout()
  plt.show()


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []
  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)

  return df_feat_eng,list_methods_worked


def FeatEngineering_Numerical(df_feat_eng,column):
  list_methods_worked = []
  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)

  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)

  return df_feat_eng,list_methods_worked

* The transformers for this dataset will be:

    * Categorical Encoding
    * Numerical Transformation
    * Smart Correlation Selection

### Dealing with Feature Engineering

### 1 - Ordinal categorical encoding

 * We have selected the following features for ordinal categorical encoding, as they are categorical values. This involves mapping each unique label to an integer value:

In [None]:
categorical_variables= ['BsmtExposure',
                        'BsmtFinType1',
                        'GarageFinish',
                        'KitchenQual']

* We create a dataframe consisting of these variables and display it

In [None]:
df_ordinal_engineering = TrainSet[categorical_variables].copy()
df_ordinal_engineering.head()

* We can now run this dataframe through our FeatureEngineeringAnalysis function

In [None]:
df_ordinal_engineering = FeatureEngineeringAnalysis(df=df_ordinal_engineering, analysis_type='ordinal_encoder')

* We can now apply this transformation to both our "Train" and "Test" datasets

In [None]:
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = categorical_variables)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.fit_transform(TestSet)

print("* Categorical encoding - ordinal transformation done!")

### 2 - Numerical Transformation

* The variables for numerical transformation are the following:

In [None]:
numerical_variables = ['1stFlrSF',
                         'GarageArea',
                         'GarageYrBlt',
                         'GrLivArea',
                         'LotFrontage', 
                         'OverallQual',
                         'TotalBsmtSF',
                         'YearBuilt',
                         'YearRemodAdd']

* We apply similar step as above and create a separate dataframe

In [None]:
df_numerical_engineering = TrainSet[numerical_variables].copy()

* Apply the transformation and assess the distributions

In [None]:
df_numerical_engineering = FeatureEngineeringAnalysis(df=df_numerical_engineering, analysis_type='numerical')

From the numerical transformations, We make the following note:

* Log tranformation is helpful to normalize variables with large numbers so that the gap between the many zeros and the rest of the values is rendered smaller. We will apply the log_e transformation to 1stFloorSF, GrLivArea, and YearBuilt.

* YeoJohnson seems to be the best transformer for GarageArea, TotalBsmtSF and YearRemodAdd, although power transformer gives similar results.

* Many of the other variables have substantial frequencies of zero values, and despite the little change in the distributions after applying any of the transformations, the power transformer has better normalized the range of values. So we will use this transformation to GarageYrBlt, LotFrontage and OverallQual.

In [None]:
lt = vt.LogTransformer(variables = ['1stFlrSF',
                                    'GrLivArea',
                                    'YearBuilt'])

yjt = vt.YeoJohnsonTransformer(variables = ['GarageArea',
                                            'TotalBsmtSF',
                                            'YearRemodAdd'])

pt = vt.PowerTransformer(variables = ['GarageYrBlt',
                                      'LotFrontage', 
                                      'OverallQual'])

transformers = [lt, pt]
for t in transformers:
    TrainSet = t.fit_transform(TrainSet)
    TestSet = t.fit_transform(TestSet)

print("* Numerical transformation done!")

### 3 - Winsoriser

* We use winsoriser instead of trimmer in order to keep the observations in the data while reducing their effect on the prediction.
* We select variables with potential outliers

In [None]:
outlier_variables = ['1stFlrSF',
                     'GarageArea',
                     'GarageYrBlt',
                     'GrLivArea',
                     'LotFrontage',
                     'OverallQual',
                     'TotalBsmtSF',
                     'YearRemodAdd']

* We apply similar step as above and create a separate dataframe

In [None]:
df_winsoriser_engineering = TrainSet[outlier_variables].copy()

* Apply the transformation and assess the distributions

In [None]:
df_winsoriser_engineering = FeatureEngineeringAnalysis(df=df_winsoriser_engineering, analysis_type='outlier_winsorizer')

* We apply the winsoriser to the train and test datasets.

In [None]:
winsoriser = Winsorizer(capping_method='iqr', tail='both', fold=1.5, variables = outlier_variables)
TrainSet = winsoriser.fit_transform(TrainSet)
TestSet = winsoriser.fit_transform(TestSet)

print("* Outlier winsoriser transformation done!")

### 4 - SmartCorrelatedSelection Variables

In [None]:
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")

corr_sel.fit_transform(df_winsoriser_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

---

## Conclusions and Next Steps

* Feature Engineering Transformers:

    * Ordinal categorical encoding: 'BsmtFinType1', 'GarageFinish', 'KitchenQual'

    * Numerical transformation: '1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'LotFrontage', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd'

    * Outlier winsoriser: '1stFlrSF', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 'LotFrontage', 'OverallQual', 'TotalBsmtSF', 'YearRemodAdd'