# Feature Engineering Notebook
### Objectives
* Engineer features for Regression and Cluster models

### Inputs
* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv
### Outputs
* generate a list with variables to engineer

# Change working directory

We need to change the working directory from its current folder to its parent folder

  * We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'C:\\Users\\issam\\Housing-market-analysis.1\\jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'C:\\Users\\issam\\Housing-market-analysis.1'

## Load Cleaned Data
Train Set

In [4]:
import pandas as pd
XtrainSet = pd.read_csv("outputs/datasets/cleaned/X_TrainCleaned.csv") #"/workspace/Housing-market-analysis./house-price-20211124T154130Z-001/house_prices_records.csv")
XtrainSet = XtrainSet.dropna()
XtrainSet.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,2158,0.0,4.0,Av,477,ALQ,725,576,Unf,1950.0,...,Gd,12615,84.0,0.0,29,7,6,1202,1950,2001
1,1614,0.0,3.0,Av,20,GLQ,1594,865,RFn,2005.0,...,Gd,11210,86.0,240.0,59,5,7,1614,2005,2006
6,1412,0.0,3.0,Mn,1005,GLQ,387,576,Unf,1988.0,...,Gd,11457,130.0,0.0,0,5,6,1392,1988,1988
10,1504,0.0,2.0,No,16,GLQ,1330,437,Fin,2005.0,...,Gd,3182,43.0,14.0,20,5,7,1346,2005,2006
12,672,672.0,3.0,No,456,Rec,216,468,Unf,1964.0,...,TA,10690,59.0,0.0,128,7,5,672,1920,1997


Test Set

In [5]:
XTestSet = pd.read_csv("outputs/datasets/cleaned/X_testCleaned.csv")

XTestSet = XTestSet.dropna()
XTestSet.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtExposure,BsmtFinSF1,BsmtFinType1,BsmtUnfSF,GarageArea,GarageFinish,GarageYrBlt,...,KitchenQual,LotArea,LotFrontage,MasVnrArea,OpenPorchSF,OverallCond,OverallQual,TotalBsmtSF,YearBuilt,YearRemodAdd
0,1091,898.0,3.0,Mn,932,GLQ,133,586,Fin,2002.0,...,Gd,11367,90.0,210.0,60,5,8,1065,2002,2002
2,1040,0.0,3.0,Mn,532,LwQ,364,484,Unf,1977.0,...,Gd,8800,61.0,0.0,0,7,5,1040,1977,2008
3,1572,1096.0,3.0,Av,1016,GLQ,556,726,Fin,2003.0,...,Ex,13688,110.0,664.0,0,5,9,1572,2003,2004
4,2020,0.0,3.0,No,1436,GLQ,570,900,Fin,2009.0,...,Ex,12220,94.0,305.0,54,5,10,2006,2009,2009
7,520,600.0,2.0,No,0,Unf,600,480,RFn,2005.0,...,Gd,3180,30.0,0.0,166,5,7,600,2005,2005


## Data Exploration

In [6]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=XtrainSet, minimal=True)
pandas_report.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

## Feature Engineering

### Custom function

In [7]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import Image, display
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')



def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder', 'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop over each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(
      f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(
        f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing values in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e", "log_10", "reciprocal", "power", "box_cox", "yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng, list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng, list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng, list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng, list_applied_transformers


def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")


def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(20, 5))
  sns.countplot(data=df_feat_eng, x=col, palette=['#432371'], order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30, y=1.05)        
  #plt.show();
  plt.savefig('DiagnosticPlots_Categories.png')
  numerical_image = Image('DiagnosticPlots_Categories.png')
  display(numerical_image)  
  print("\n")


def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(20, 6))
  sns.histplot(data=df, x=variable, kde=True, element="step", ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable], ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30, y=1.05)
  #plt.show();


def FeatEngineering_CategoricalEncoder(df_feat_eng, column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"], axis=1, inplace=True)
    
  return df_feat_eng, list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng, column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"], axis=1, inplace=True)


  return df_feat_eng, list_methods_worked



def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"], axis=1, inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"], base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"], axis=1, inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"], axis=1, inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"], axis=1, inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"], axis=1, inplace=True)
#ss

  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"], axis=1, inplace=True)


  return df_feat_eng, list_methods_worked

## Dealing with Feature Engineering
### Categorical Encoding - Ordinal: replaces categories with ordinal numbers

* Step 1: Select variable(s)

In [8]:
features_engineering = [ 'BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']
features_engineering

['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual']

* Step 2: Create a separate DataFrame, with your variable(s)

In [9]:
df_engineering = XtrainSet[features_engineering].copy()
df_engineering.head()
df_engineering = df_engineering.dropna() #subset=['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual'])
df_engineering.isna().sum()

BsmtExposure    0
BsmtFinType1    0
GarageFinish    0
KitchenQual     0
dtype: int64

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method for each variable.

In [10]:
    df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='ordinal_encoder')

* Variable Analyzed: BsmtExposure
* Applied transformation: ['BsmtExposure_ordinal_encoder'] 



NameError: name 'Image' is not defined

In [None]:
# 1 - create a transformer
encoder = OrdinalEncoder(encoding_method='arbitrary', variables = features_engineering)

# 2 - fit_transform into TrainSet
TrainSet = encoder.fit_transform(XtrainSet)

# 3 - transform into TestSet 
TestSet = encoder.fit_transform(XTestSet)

print("* Categorical encoding - ordinal transformation inplace")

### Numerical Transformation
* Step 1: Select variables(s)

In [None]:
features_engineering = ['1stFlrSF', '2ndFlrSF', 'BedroomAbvGr', 'BsmtFinSF1', 
                        'BsmtUnfSF', 'EnclosedPorch', 'GarageArea', 'GarageYrBlt', 'GrLivArea', 
                        'LotArea', 'LotFrontage',  'OpenPorchSF', 'OverallCond', 
                        'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd']#'MasVnrArea',
features_engineering

* Step 2: Create a separate DataFrame, with your variable(s)

In [None]:
df_engineering = XtrainSet[features_engineering].copy()
df_engineering.head()

* Step 3: Create engineered variables(s) by applying the transformation(s), assess engineered variables distribution and select the most suitable method

In [None]:
df_engineering = FeatureEngineeringAnalysis(df=df_engineering, analysis_type='numerical')

In [None]:
from sklearn.preprocessing import StandardScaler

# 1. Create a transformer
scaler = StandardScaler()

# 2. Fit-Transform into TrainSet
XtrainSet[features_engineering] = scaler.fit_transform(TrainSet[features_engineering])

# 3. Transform into TestSet
XTestSet[features_engineering] = scaler.fit_transform(TestSet[features_engineering])

print("* Numerical transformation - StandardScaler transformation inplace")

In [None]:
df_engineering = XtrainSet.copy()
df_engineering.head()

* Create engineered variables(s) intergrate transformation(s)

In [None]:
from feature_engine.selection import SmartCorrelatedSelection
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.6, selection_method="variance")

corr_sel.fit_transform(df_engineering)
corr_sel.correlated_feature_sets_

In [None]:
corr_sel.features_to_drop_

## Conclusion
Based on the transformations applied to each variable, we can draw conclusions on their effectiveness:

* **1stFlrSF**: The variety of transformations applied suggests that the data might not be normally distributed. Transformations like log, reciprocal, and power functions are commonly used to normalize skewed data distributions.

* **2ndFlrSF**: Only two transformations were applied, indicating a less skewed distribution compared to 1stFlrSF.

* **BedroomAbvGr**: Similar to **1stFlrSF**, a variety of transformations were applied, indicating potential skewness in the data distribution.

* **BsmtFinSF1**, **BsmtUnfSF**, **EnclosedPorch**, **GarageArea**, **MasVnrArea**, **OpenPorchSF**, **TotalBsmtSF**: Similar to **2ndFlrSF**, only two transformations were applied to these variables, suggesting relatively less skewness in their distributions.

* **GarageYrBlt**, **GrLivArea**, **LotArea**, **LotFrontage**, **OverallCond**, **OverallQual**, **YearBuilt**, **YearRemodAdd**: A wide range of transformations were applied to these variables, indicating potential skewness and the need to normalize the data distribution.