<a href="https://colab.research.google.com/github/FernandoRocha88/WalkthroughProject/blob/main/jupyter_notebooks/03-%20FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering Notebook

## Objectives

*   Engineer features for Clf, Reg and Cluster models


## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* generate Train and Test sets with engineered features, both saved under inputs/datasets/feat_eng

## Additional Comments | Insights | Conclusions



  * xxxx

* Feature Engineering
  * xxx


---

# Install Packages

In [None]:
! pip install matplotlib -U
! pip install pandas-profiling==2.11.0
! pip install feature-engine==1.0.2
! pip install ppscore==1.2.0

In [None]:
# Code for restarting the runtime (that will restart colab session, all your variables will be lost)
import os
os.kill(os.getpid(), 9)


# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 
print("=== Insert your credentials === \nType in and hit Enter")
UserName = getpass('GitHub User Name: ')
UserEmail = getpass('GitHub User E-mail: ')
RepoName = getpass('GitHub Repository Name: ')
UserPwd = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{UserName}/{RepoName}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need for this project

print("\n")
%cd /content/{RepoName}
print(f"\n\n* Current session directory is:  {os.getcwd()}")
print(f"* You may refresh the session folder to access {RepoName} folder.")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
!git config --global user.email {UserEmail}
!git config --global user.name {UserName}
!git remote rm origin
!git remote add origin https://{UserName}:{UserPwd}@github.com/{UserName}/{RepoName}.git

print(f"\n\n * The current Colab Session is connected to the following GitHub repo: {UserName}/{RepoName}")
print(" * You can now push new files to the repo.")

---

### **Push** generated/new files from this Session to GitHub repo

* Git commit

In [None]:
CommitMsg = "added-cleaned-data"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {RepoName}
print(f"\n * Please refresh session folder to validate that {RepoName} folder was removed from this session.")

---

# Load your data

In [None]:
import pandas as pd
train_set_path = "/content/WalkthroughProject/inputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.info()

In [None]:
test_set_path = '/content/WalkthroughProject/inputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.info()

## Quick exploration with Pandas Profiling

In [None]:
TrainSet.columns

In [None]:
from pandas_profiling import ProfileReport
flag_minimal=True # True  False
ProfileReport(df=TrainSet, minimal=flag_minimal).to_notebook_iframe()

# Correlation and PPS

## Correlation

* which variables are more correlated with a given set of variables?

In [None]:
df_corr_spearman = TrainSet.corr(method="spearman")
df_corr_pearson = TrainSet.corr(method="pearson")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def heatmap_correlation(df_corr,CorrThreshold):
  NumberOfColumns = len(df.columns)

  if NumberOfColumns > 1:
      mask = np.zeros_like(df_corr, dtype=np.bool)
      mask[np.triu_indices_from(mask)] = True
      mask[abs(df_corr) < CorrThreshold] = True

      fig, ax = plt.subplots(figsize=(20,10))
      ax = sns.heatmap(data=df_corr,annot=True,
                       xticklabels=True,yticklabels=True,
                       mask=mask,cmap='viridis',annot_kws={"size": 8})
      plt.ylim(NumberOfColumns,0)
      plt.show()


In [None]:
print("Correlation Heatmap - Spearman: evaluates the monotonic relationship \n")
heatmap_correlation(df_corr=df_corr_spearman, CorrThreshold=0.6)

In [None]:
print("Correlation Heatmap - Pearson: evaluates the linear relationship between two continuous variables \n")
heatmap_correlation(df_corr=df_corr_pearson,CorrThreshold=0.6)

## PPS

In [None]:
ppsMatrixRaw

In [None]:
import ppscore as pps
# ppsMatrixRaw = pps.matrix(df.sample(frac=0.02))
FullMatrix = (ppsMatrixRaw
    [['x', 'y', 'ppscore']]
    .pivot(columns='x', index='y', values='ppscore')
    )

In [None]:
def heatmap_pps(df,PPS_Threshold):
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    if len(df.columns) > 1:

        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < PPS_Threshold] = True

        fig, ax = plt.subplots(figsize=(20,8))
        ax = sns.heatmap(
            df, 
            annot=True,
            xticklabels=True,
            yticklabels=True,
            mask=mask,
            cmap='Blues',
            annot_kws={"size": 7})
        
        plt.ylim(len(df.columns),0)
        plt.show()

heatmap_pps(df=FullMatrix,PPS_Threshold=0.1)

* pps heatmap with target

In [None]:
def heatmap_pps_target(df,NumberOfColumns):
  import matplotlib.pyplot as plt
  import seaborn as sns
  import numpy as np
  fig, ax = plt.subplots(figsize=(20,8))
  ax = sns.heatmap(
          df,
          xticklabels=True,
          yticklabels=True,
          annot=True,
          cmap='coolwarm',
          annot_kws={"size": 8})

  plt.ylim(NumberOfColumns,0)
  plt.show()

heatmap_pps_target(df=FullMatrix,NumberOfColumns=df.shape[1])

# Feature Engineering

* At this stage, there are no missing data in your Train and Test sets.
* Now you are looking to engineer, to transform, your variables, so the Machine Learning model will better learn the relationships among the variables (features and lables).
  * It is important to run a quick EDA to asess variables distribution shape. Machine Learning models learn better when the distribution is normal. To engineer that, you can use transformers in packages like feature-engine or sklearn.
  * You can also use your business acumen and technical expertise to create new variables. For example, imagine if your dataset is about your orange juice company operation. There is a variable called "revenue" and other called "volume", you divide revenue by volume to know how much money you make per liter of manufactured juice

---   

* **Strategy**


* **1 - List all variables you are initially interested to engineer, dividing per groups, first numerical variables, then categorical variables**

* Numerical
  * MaxTemp
  * Rainfall
  * WindGustSpeed
  * WindSpeed9am
  * WindSpeed3pm
  * Humidity9am
  * Humidity3pm
  * Temp9am
  * Temp3pm
  * Latitude
  * Longitude
  * RainfallTomorrow  (target variable for Reg model)

* Categorical
  * WindGustDir
  * WindDir9am
  * WindDir3pm
  * Cloud9am
  * State




* **2 - Consider the following template to help your engineering process**

  * 1 - Select variable(s) and describe distribution
  * 2 - Select the engineering method(s)
  * 3 - Create a separate dataframe, with your variable(s)
  * 4 - Create engineered variables(s) applying the method(s)
  * 5 - Assess engineered variables distribution and select most suitable method
  * 6 - If you are satisfied, apply the selected method to the Train and Test set


## Custom functions for engineering numerical variables

* xxxxx

In [None]:
from feature_engine import transformation as vt
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
import warnings
warnings.filterwarnings('ignore')

def FeatEngineering_NumericalVariableTransformers(df):
    """
    - used for quick feat engineering on numerical variables
    to decide which method can better transform the distribution shape to
    look like more gaussian.
    - Once transformed, use a reporting tool, like sweetviz, to evaluate distributions

    - Transformers applied include: LogTransformer, ReciprocalTransformer,
    PowerTransformer, BoxCoxTransformer, YeoJohnsonTransformer

    """

    df_feat_eng = pd.DataFrame([]) # dataframe with methodsapplied

    for columm in df.columns:

      ### arrange columns as: variable + variable_methods
      df_feat_eng = pd.concat([df_feat_eng, df[columm]], axis=1)
      for transformer in ["lt","rt", "pt","bct","yj"]:
          df_feat_eng[f"{columm}_{transformer}"] = df[columm]

      ### Apply methods in respectives variable_methods
      ### If method cant be applied, remove columns
      list_methods_worked = []

      # LogTransformer
      try: 
        lt = vt.LogTransformer(variables = [f"{columm}_lt"])
        df_feat_eng = lt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{columm}_lt")
      except: 
        df_feat_eng.drop([f"{columm}_lt"],axis=1,inplace=True)

      # ReciprocalTransformer
      try:
        rt = vt.ReciprocalTransformer(variables = [f"{columm}_rt"])
        df_feat_eng =  rt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{columm}_rt")
      except:
        df_feat_eng.drop([f"{columm}_rt"],axis=1,inplace=True)

      # PowerTransformer
      try:
        pt = vt.PowerTransformer(variables = [f"{columm}_pt"])
        df_feat_eng = pt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{columm}_pt")
      except:
        df_feat_eng.drop([f"{columm}_pt"],axis=1,inplace=True)

      # BoxCoxTransformer
      try:
        bct = vt.BoxCoxTransformer(variables = [f"{columm}_bct"])
        df_feat_eng = bct.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{columm}_bct")
      except:
        df_feat_eng.drop([f"{columm}_bct"],axis=1,inplace=True)


      # YeoJohnsonTransformer
      try:
        yjt = vt.YeoJohnsonTransformer(variables = [f"{columm}_yj"])
        df_feat_eng = yjt.fit_transform(df_feat_eng)
        list_methods_worked.append(f"{columm}_yj")
      except:
        df_feat_eng.drop([f"{columm}_yj"],axis=1,inplace=True)

 
      print(f"* Variable analyzed: {columm}")
      print(f"* Methods that worked: {list_methods_worked} \n")
      
      for col in [columm] + list_methods_worked:
        print(f"* {col}")
        diagnostic_plots(df_feat_eng, col)
        print("\n")

    return df_feat_eng



def diagnostic_plots(df, variable):

    fig, (ax1, ax2,ax3) = plt.subplots(1, 3, figsize=(20, 6))

    sns.histplot(data=df, x=variable, kde=True,element="step",ax=ax1) 
    stats.probplot(df[variable], dist="norm", plot=ax2)
    sns.boxplot(y=df[variable],ax=ax3)

    # analysis on outliers
    # shapiro analysis
    
    plt.show();

## Template for Feature Engineering (replace with variable(s) name)

* Step 1: Select variable(s) and describe distribution

In [None]:
variables_engineering = []

* Step 2: Select the engineering method(s)

In [None]:
#####

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = Trainset[variables_engineering].copy()

* Step 4: Create engineered variables(s) applying the method(s)

In [None]:
# use custom function
df_engineering = NumericalFeatureEngineering(df_engineering)

* Step 5: Assess engineered variables distribution and select most suitable method

* Step 6: If you are satisfied, apply the selected method to the Train and Test set

In [None]:
TrainSet, TestSet = ....

## Numerical Variables

* Step 1: Select variable(s) and describe distribution

In [None]:
variables_engineering = ['MaxTemp', 'Rainfall', 'WindGustSpeed','WindSpeed9am',
                         'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am',
                         'Pressure3pm',  'Temp9am', 'Temp3pm',  'Latitude', 'Longitude']

* Step 2: Select the engineering method(s)

In [None]:
#####

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
df_engineering = FeatEngineering_NumericalVariableTransformers(df=df_engineering)
df_engineering

* The most suitable method is:  MaxTemp_pt

* Step 5: If you are satisfied, apply the selected method to the Train and Test set

In [None]:
TrainSet, TestSet = ....

# Save feature engineered data: Train/Test sets 

In [None]:
# TrainSet.to_csv("/content/WalkthroughProject/inputs/datasets/cleaned/TrainSetCleaned.csv",index=False)
# TestSet.to_csv("/content/WalkthroughProject/inputs/datasets/cleaned/TestSetCleaned.csv",index=False)

* You may now go to "Push generated/new files from this session to GitHub Repo" section and push these files to the repo