# Feature Engineering Notebook

## Objectives
- Engineer features for Classification, Regression and Cluster models


## Inputs
The inputs for this stage will include cleaned datasets that have been previously prepared:
- **"outputs/datasets/cleaned/train_set.csv"**: The training set used for training the models.
- **"outputs/datasets/cleaned/test_set.csv"**: The test set used for model validation and performance evaluation.

## Outputs
- Generate a list of the variables that will be engineered, describing the nature of the transformations.

---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Cleaned Data

In this section, we load the preprocessed training and testing datasets to prepare for feature engineering and model development.

We'll confirm that the data is properly formatted and ready for further analysis.

## Train Set

In [None]:
import pandas as pd

# Define the path to the cleaned training dataset
train_set_path = "outputs/datasets/cleaned/train_set.csv"

# Load the training dataset from the specified path into a DataFrame
train_set = pd.read_csv(train_set_path)

# Display the first few rows of the training dataset to verify its structure and content
train_set.head()

## Test Set

In [None]:
# Define the path to the cleaned testing dataset
test_set_path = "outputs/datasets/cleaned/test_set.csv"

# Load the testing dataset from the specified path into a DataFrame
test_set = pd.read_csv(test_set_path)

# Display the first few rows of the testing dataset to verify its structure and content
test_set.head()

---

# Data Exploration

We will now use Pandas Profiling to analyze the variables in our datasets and evaluate potential transformations for feature engineering.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=train_set, minimal=True)
pandas_report.to_notebook_iframe()

---

# Feature Engineering

## Custom Function

First, we'll utilize a function from the Feature-engine lesson from Code Institute to assist us with the feature engineering process.

In [None]:
from feature_engine import transformation as vt
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import OrdinalEncoder
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')



def FeatureEngineeringAnalysis(df,analysis_type=None):


  """
  - used for quick feature engineering on numerical and categorical variables
  to decide which transformation can better transform the distribution shape 
  - Once transformed, use a reporting tool, like pandas-profiling, to evaluate distributions

  """
  check_missing_values(df)
  allowed_types= ['numerical', 'ordinal_encoder',  'outlier_winsorizer']
  check_user_entry_on_analysis_type(analysis_type, allowed_types)
  list_column_transformers = define_list_column_transformers(analysis_type)
  
  
  # Loop over each variable and engineer the data according to the analysis type
  df_feat_eng = pd.DataFrame([])
  for column in df.columns:
    # create additional columns (column_method) to apply the methods
    df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
    for method in list_column_transformers:
      df_feat_eng[f"{column}_{method}"] = df[column]
      
    # Apply transformers in respectives column_transformers
    df_feat_eng, list_applied_transformers = apply_transformers(analysis_type, df_feat_eng, column)

    # For each variable, assess how the transformations perform
    transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng)

  return df_feat_eng


def check_user_entry_on_analysis_type(analysis_type, allowed_types):
  ### Check analyis type
  if analysis_type == None:
    raise SystemExit(f"You should pass analysis_type parameter as one of the following options: {allowed_types}")
  if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be one of these options: {allowed_types}")

def check_missing_values(df):
  if df.isna().sum().sum() != 0:
    raise SystemExit(
        f"There is missing values in your dataset. Please handle that before getting into feature engineering.")



def define_list_column_transformers(analysis_type):
  ### Set suffix colummns acording to analysis_type
  if analysis_type=='numerical':
    list_column_transformers = ["log_e","log_10","reciprocal", "power","box_cox","yeo_johnson"]
  
  elif analysis_type=='ordinal_encoder':
    list_column_transformers = ["ordinal_encoder"]

  elif analysis_type=='outlier_winsorizer':
    list_column_transformers = ['iqr']

  return list_column_transformers



def apply_transformers(analysis_type, df_feat_eng, column):


  for col in df_feat_eng.select_dtypes(include='category').columns:
    df_feat_eng[col] = df_feat_eng[col].astype('object')


  if analysis_type=='numerical':
    df_feat_eng,list_applied_transformers = FeatEngineering_Numerical(df_feat_eng,column)
  
  elif analysis_type=='outlier_winsorizer':
    df_feat_eng,list_applied_transformers = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

  elif analysis_type=='ordinal_encoder':
    df_feat_eng,list_applied_transformers = FeatEngineering_CategoricalEncoder(df_feat_eng,column)

  return df_feat_eng,list_applied_transformers



def transformer_evaluation(column, list_applied_transformers, analysis_type, df_feat_eng):
  # For each variable, assess how the transformations perform
  print(f"* Variable Analyzed: {column}")
  print(f"* Applied transformation: {list_applied_transformers} \n")
  for col in [column] + list_applied_transformers:
    
    if analysis_type!='ordinal_encoder':
      DiagnosticPlots_Numerical(df_feat_eng, col)
    
    else:
      if col == column: 
        DiagnosticPlots_Categories(df_feat_eng, col)
      else:
        DiagnosticPlots_Numerical(df_feat_eng, col)

    print("\n")



def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(10, 2.5))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'],order = df_feat_eng[col].value_counts().index)
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show();
  print("\n")



def DiagnosticPlots_Numerical(df, variable):
  fig, axes = plt.subplots(1, 3, figsize=(12, 4))
  sns.histplot(data=df, x=variable, kde=True,element="step",ax=axes[0]) 
  stats.probplot(df[variable], dist="norm", plot=axes[1])
  sns.boxplot(x=df[variable],ax=axes[2])
  
  axes[0].set_title('Histogram')
  axes[1].set_title('QQ Plot')
  axes[2].set_title('Boxplot')
  fig.suptitle(f"{variable}", fontsize=30,y=1.05)
  plt.tight_layout()
  plt.show();


def FeatEngineering_CategoricalEncoder(df_feat_eng,column):
  list_methods_worked = []
  try:  
    encoder= OrdinalEncoder(encoding_method='arbitrary', variables = [f"{column}_ordinal_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_ordinal_encoder")
  
  except: 
    df_feat_eng.drop([f"{column}_ordinal_encoder"],axis=1,inplace=True)
    
  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):
  list_methods_worked = []

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=1.5, variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked




def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_e"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_e")
  except: 
    df_feat_eng.drop([f"{column}_log_e"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_log_10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_log_10")
  except: 
    df_feat_eng.drop([f"{column}_log_10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_reciprocal"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_reciprocal")
  except:
    df_feat_eng.drop([f"{column}_reciprocal"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_power"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_power")
  except:
    df_feat_eng.drop([f"{column}_power"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_box_cox"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_box_cox")
  except:
    df_feat_eng.drop([f"{column}_box_cox"],axis=1,inplace=True)


  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yeo_johnson"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yeo_johnson")
  except:
        df_feat_eng.drop([f"{column}_yeo_johnson"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked

## Feature Engineering Summary

During the feature engineering phase, we focus exclusively on the training dataset to prepare our data for the subsequent modeling stages.

We will use the following transformers to enhance the predictive power of our dataset:
- **Categorical Encoding**: Convert text variables into numerical format to facilitate their use in machine learning models.
- **Numerical Transformations**: Apply transformations to numerical variables to improve their distribution, which is expected to enhance model performance.
- **Smart Correlation Selection**: Evaluate all variables for redundancy using correlation analysis and remove any that are deemed unnecessary, streamlining the feature set for optimal efficiency.

These steps are designed to prepare our data, ensuring it is optimized for the modeling process that will follow.

## Categorical Encoding

We will convert text variables into numerical format in the following steps:

**Step 1**: Identify categorical variables in the training dataset.

In [None]:
# Selecting categorical variables from the train_set DataFrame
# 'select_dtypes' is used to filter columns based on data type
# Including both 'object' and 'category' data types to capture all possible categorical columns
categorical_variables = list(train_set.select_dtypes(['object','category']).columns)

# Display the list of categorical variables
categorical_variables

**Step 2**: Create a new DataFrame containing the previously identified categorical variables.

In [None]:
# Create a new DataFrame by extracting the columns listed in 'categorical_variables' from 'train_set'
# The 'copy()' function is used to ensure that the new DataFrame is a separate object
df_categorical_engineering = train_set[categorical_variables].copy()

# Display the first few rows of the new DataFrame
df_categorical_engineering.head()

**Step 3**: Transform the variables and evaluate their distributions to determine the most appropriate method for each.

In [None]:
# Apply the 'FeatureEngineeringAnalysis' class to the DataFrame
# Specify to perform an 'ordinal_encoder' analysis, converting categorical text data into a numerical format
categorical_transformed = FeatureEngineeringAnalysis(df=df_categorical_engineering, analysis_type='ordinal_encoder')

**Step 4**: Implement the transformation on both the Train and Test sets.

In [None]:
# Set the encoding method to 'arbitrary' which assigns integers to the categories in the order they are observed.
# Specify the variables to be encoded using the list 'categorical_variables'.
encoder = OrdinalEncoder(encoding_method='arbitrary', variables=categorical_variables)

# Applies the transformation on the training set
train_set = encoder.fit_transform(train_set)

# Applies the transformation on the testing set
test_set = encoder.transform(test_set)

# Print a confirmation message
print("* Categorical encoding - ordinal transformation done!")

## Numerical Transformations

In this section, we will apply transformations to numerical variables to optimize their distributions. 

**Step 1**: Identify numerical variables in the training dataset.

In [None]:
# Variables to be excluded
exclude_vars = ['BsmtExposure', 'BsmtFinType1', 'GarageFinish', 'KitchenQual', 'SalePrice']

# Extract a list of numerical variables from the train_set DataFrame
# 'select_dtypes' is used to filter columns based on data type, including both integer and float types
all_numerical_variables = list(train_set.select_dtypes(include=['int64', 'float64']).columns)

# Exclude the transformed categorical variables from the list
numerical_variables = [var for var in all_numerical_variables if var not in exclude_vars]

# Display the list of numerical variables
numerical_variables

**Step 2**: Create a new DataFrame containing the previously identified numerical variables.

In [None]:
# Create a new DataFrame by extracting the columns listed in 'numerical_variables' from 'train_set'
# The 'copy()' function is used to ensure that the new DataFrame is a separate object
df_numerical_engineering = train_set[numerical_variables].copy()

# Display the first few rows of the new DataFrame
df_numerical_engineering.head()

**Step 3**: Transform the variables and evaluate their distributions to determine the most appropriate method for each and implement the transformation on both the Train and Test sets.

In [None]:
# Apply the 'FeatureEngineeringAnalysis' class to the DataFrame
# The parameter 'analysis_type' is set to 'numerical', indicating that the function should perform numerical transformations.
numerical_transformed = FeatureEngineeringAnalysis(df=df_numerical_engineering, analysis_type='numerical')

| Feature          | Transformation Decision                                   | Reasoning                                                                 |
|------------------|-----------------------------------------------------------|---------------------------------------------------------------------------|
| **1stFlrSF**     | <span style="color:yellow">Log</span>                     | Log improves distribution and QQ plot alignment significantly.            |
| **2ndFlrSF**     | None                                                      | Minimal impact from power; no substantial justification for change.       |
| **BedroomAbvGr** | None                                                      | No significant improvement from any transformation.                       |
| **BsmtFinType1** | <span style="color:red">Power</span>                      | Power improves distribution and QQ plot, especially excluding outlier (0).|
| **BsmtUnfSF**    | <span style="color:red">Power</span>                      | Both Power and Yeo-Johnson improve distribution and QQ plots comparably.  |
| **GarageArea**   | <span style="color:green">Yeo-Johnson</span>              | Best improvement in QQ plots, especially excluding outlier (0).           |
| **GrLivArea**    | <span style="color:yellow">Log</span>                     | Significantly enhances distribution and QQ plot.                          |
| **LotArea**      | <span style="color:red">Power</span>                      | Drastically improves QQ plot alignment.                                   |
| **LotFrontage**  | None                                                      | Better original QQ plot despite improved histogram with Power and Yeo.    |
| **MasVnrArea**   | <span style="color:red">Power</span>                      | Significant improvement in both distribution and QQ plot excluding (0).   |
| **OpenPorchSF**  | <span style="color:red">Power</span>                      | Notable improvement in QQ plot over Yeo-Johnson.                          |
| **OverallCond**  | None                                                      | No transformation shows significant improvement.                          |
| **OverallQual**  | None                                                      | No significant improvement from any transformation.                       |
| **TotalBsmtSF**  | <span style="color:red">Power</span>                      | Leads in QQ plot improvement over Yeo-Johnson.                            |
| **YearBuilt**    | None                                                      | No transformation yields significant improvements.                        |
| **YearRemodAdd** | None                                                      | No significant improvement from any transformation.                       |


In [None]:
from sklearn.pipeline import Pipeline

# Define the variables to be transformed using Log, Power, and Yeo-Johnson transformations
variables_log = ["1stFlrSF", "GrLivArea"]
variables_power = ["BsmtFinType1", "BsmtUnfSF", "LotArea", "MasVnrArea", "OpenPorchSF", "TotalBsmtSF"]
variables_yeo = ["GarageArea"]

# Create a pipeline comprising the specified transformations for specified groups of variables
pipeline = Pipeline([
      ("log", vt.LogTransformer(variables = variables_log, base= 'e')),
      ("pwr", vt.PowerTransformer(variables = variables_power)),
      ("yeo", vt.YeoJohnsonTransformer(variables = variables_yeo))
    ])

# Fit the pipeline to the training set and transform the training data
train_set = pipeline.fit_transform(train_set)

# Apply the same transformations to the test set
test_set = pipeline.transform(test_set)

# Print confirmation that the numerical transformations have been completed
print("* The numerical transformation has been completed!")

We will verify that the transformations have been accurately applied to both the training and test datasets.

In [None]:
train_set.head()

In [None]:
test_set.head()

## Smart Correlated Selection

In this section, we will use smart correlated selection across all features to eliminate redundant ones.

**Step 1**: Identify the variables for smart correlated selection, focusing exclusively on the features.

- In this analysis, we exclude the target variable because our objective is to examine correlations among features only.

- The primary goal is to reduce redundancy within the predictors in our model. Since we aim to model the relationship between features and the target, removing the target due to its correlation with features would counteract the purpose of predictive modeling.

In summary, we keep **Sale Price** out of the feature selection process that aims to eliminate correlated predictors.

In [None]:
df_correlation_variables = train_set.drop('SalePrice', axis=1)

**Step 2**: Create a new DataFrame containing the previously identified variables.

In [None]:
# Create a new DataFrame by extracting the columns listed in 'correlation_variables' from 'train_set'
# The 'copy()' function is used to ensure that the new DataFrame is a separate object
df_correlation_engineering = df_correlation_variables.copy()

# Display the first few rows of the new DataFrame
df_correlation_engineering.head()

**Step 3**: Analyze the correlation among features and identify highly correlated pairs to determine which variables to remove for reducing redundancy.

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

# Initialize the SmartCorrelatedSelection transformer. 
# "variables=None" considers all numerical variables in the DataFrame.
# "method="spearman" specifies using the Spearman rank correlation.
# "threshold=0.8" sets the minimum correlation coefficient for which features will be considered highly correlated.
# "selection_method="variance" indicates that among correlated groups, the feature with the lowest variance is dropped.
corr_sel = SmartCorrelatedSelection(variables=None, method="spearman", threshold=0.8, selection_method="variance")

corr_sel.fit_transform(df_correlation_engineering)
corr_sel.correlated_feature_sets_

**Step 4**: We will remove any excess among the correlated features since they contribute redundant information to the model.

In [None]:
# Retrieve and display the list of features that were identified as redundant
corr_sel.features_to_drop_

---

# Conclusions and Next Steps

## Conclusions

Feature Engineering Transformations:
- **Ordinal Categorical Encoding**: Applied to variables such as **BsmtExposure**, **BsmtFinType1**, **GarageFinish**, and **KitchenQual**.
- **Log-e Numerical Transformation**: Implemented on **GrLivArea** to enhance its distribution.
- **Power Numerical Transformation**: Used for **BsmtUnfSF**, **BsmtFinType1**, **MasVnrArea**, **OpenPorchSF**, and **TotalBsmtSF** to adjust their scales and distributions.
- **Yeo-Johnson Numerical Transformation**: Applied to **GarageArea** for normalization.
- **Smart Correlated Selection**: **1stFlrSF** was identified as redundant and removed.

## Next Steps
- **Modeling**: Proceed with selecting and applying a linear regression algorithm to the refined dataset.
- **Hyperparameter Tuning**: Optimize the model’s settings to achieve the best performance possible.