<a href="https://colab.research.google.com/github/FernandoRocha88/WalkthroughProject/blob/main/jupyter_notebooks/03-%20FeatureEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering Notebook

## Objectives

*   Engineer features for Clf, Reg and Cluster models


## Inputs

* inputs/datasets/cleaned/TrainSet.csv
* inputs/datasets/cleaned/TestSet.csv

## Outputs

* generate Train and Test sets with engineered features, both saved under inputs/datasets/feat_eng

## Additional Comments | Insights | Conclusions



* Feature Engineering
  * xxx


---

# Install Packages

In [None]:
! pip install matplotlib -U
! pip install pandas-profiling==2.11.0
! pip install feature-engine==1.0.2
! pip install ppscore==1.2.0

* Code for restarting the runtime (that will restart colab session, all your variables will be lost)

In [None]:
import os
os.kill(os.getpid(), 9)

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session
  * Typically the output will be /device:GPU:0


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 
print("=== Insert your credentials === \nType in and hit Enter")
UserName = getpass('GitHub User Name: ')
UserEmail = getpass('GitHub User E-mail: ')
RepoName = getpass('GitHub Repository Name: ')
UserPwd = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{UserName}/{RepoName}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need for this project

print("\n")
%cd /content/{RepoName}
print(f"\n\n* Current session directory is:  {os.getcwd()}")
print(f"* You may refresh the session folder to access {RepoName} folder.")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
!git config --global user.email {UserEmail}
!git config --global user.name {UserName}
!git remote rm origin
!git remote add origin https://{UserName}:{UserPwd}@github.com/{UserName}/{RepoName}.git

print(f"\n\n * The current Colab Session is connected to the following GitHub repo: {UserName}/{RepoName}")
print(" * You can now push new files to the repo.")

---

### **Push** generated/new files from this Session to GitHub repo

* Git commit

In [None]:
CommitMsg = "added-cleaned-data"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {RepoName}
print(f"\n * Please refresh session folder to validate that {RepoName} folder was removed from this session.")

---

# Load your data

In [None]:
import pandas as pd
train_set_path = "/content/WalkthroughProject/inputs/datasets/cleaned/TrainSetCleaned.csv"
TrainSet = pd.read_csv(train_set_path)
TrainSet.info()

In [None]:
test_set_path = '/content/WalkthroughProject/inputs/datasets/cleaned/TestSetCleaned.csv'
TestSet = pd.read_csv(test_set_path)
TestSet.info()

## Quick exploration with Pandas Profiling

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(
                            df=TrainSet,
                            minimal=True) # True  False
pandas_report.to_notebook_iframe()

# Correlation and PPS

## Correlation

* which variables are more correlated with a given set of variables?

In [None]:
df_corr_spearman = TrainSet.corr(method="spearman")
df_corr_pearson = TrainSet.corr(method="pearson")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def heatmap_correlation(df_corr,CorrThreshold):
  NumberOfColumns = len(df.columns)

  if NumberOfColumns > 1:
      mask = np.zeros_like(df_corr, dtype=np.bool)
      mask[np.triu_indices_from(mask)] = True
      mask[abs(df_corr) < CorrThreshold] = True

      fig, ax = plt.subplots(figsize=(20,10))
      ax = sns.heatmap(data=df_corr,annot=True,
                       xticklabels=True,yticklabels=True,
                       mask=mask,cmap='viridis',annot_kws={"size": 8})
      plt.ylim(NumberOfColumns,0)
      plt.show()


In [None]:
print("Correlation Heatmap - Spearman: evaluates the monotonic relationship \n")
heatmap_correlation(df_corr=df_corr_spearman, CorrThreshold=0.6)

In [None]:
print("Correlation Heatmap - Pearson: evaluates the linear relationship between two continuous variables \n")
heatmap_correlation(df_corr=df_corr_pearson,CorrThreshold=0.6)

## PPS

In [None]:
ppsMatrixRaw

In [None]:
import ppscore as pps
# ppsMatrixRaw = pps.matrix(df.sample(frac=0.02))
FullMatrix = (ppsMatrixRaw
    [['x', 'y', 'ppscore']]
    .pivot(columns='x', index='y', values='ppscore')
    )

In [None]:
def heatmap_pps(df,PPS_Threshold):
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    if len(df.columns) > 1:

        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < PPS_Threshold] = True

        fig, ax = plt.subplots(figsize=(20,8))
        ax = sns.heatmap(
            df, 
            annot=True,
            xticklabels=True,
            yticklabels=True,
            mask=mask,
            cmap='Blues',
            annot_kws={"size": 7})
        
        plt.ylim(len(df.columns),0)
        plt.show()

heatmap_pps(df=FullMatrix,PPS_Threshold=0.1)

* pps heatmap with target

In [None]:
def heatmap_pps_target(df,NumberOfColumns):
  import matplotlib.pyplot as plt
  import seaborn as sns
  import numpy as np
  fig, ax = plt.subplots(figsize=(20,8))
  ax = sns.heatmap(
          df,
          xticklabels=True,
          yticklabels=True,
          annot=True,
          cmap='coolwarm',
          annot_kws={"size": 8})

  plt.ylim(NumberOfColumns,0)
  plt.show()

heatmap_pps_target(df=FullMatrix,NumberOfColumns=df.shape[1])

# Feature Engineering

* At this stage, there are no missing data in your Train and Test sets.
* Now you are looking to engineer, to transform, your variables, so the Machine Learning model will better learn the relationships among the variables (features and lables).
  * It is important to run a quick EDA to asess variables distribution shape. Machine Learning models learn better when the distribution is normal. To engineer that, you can use transformers in packages like feature-engine or sklearn.
  * You can also use your business acumen and technical expertise to create new variables. For example, imagine if your dataset is about your orange juice company operation. There is a variable called "revenue" and other called "volume", you divide revenue by volume to know how much money you make per liter of manufactured juice

---   

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
num_variables = TrainSet.select_dtypes(exclude=numerics ).columns
num_variables

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
cat_variables = TrainSet.select_dtypes(exclude=numerics ).columns
cat_variables

* **Strategy**


* **1 - List all variables you are initially interested to engineer, dividing per groups, first numerical variables, then categorical variables**

* Numerical
  * MaxTemp
  * Rainfall
  * WindGustSpeed
  * WindSpeed9am
  * WindSpeed3pm
  * Humidity9am
  * Humidity3pm
  * Temp9am
  * Temp3pm
  * Latitude
  * Longitude
  * RainfallTomorrow  (target variable for Reg model)

* Categorical
  * WindGustDir
  * WindDir9am
  * WindDir3pm
  * Cloud9am
  * State
  * Location




* **2 - Consider the following template to help your engineering process**

  * 1 - Select variable(s) and describe distribution
  * 2 - Select the engineering method(s)
  * 3 - Create a separate dataframe, with your variable(s)
  * 4 - Create engineered variables(s) applying the method(s)
  * 5 - Assess engineered variables distribution and select most suitable method
  * 6 - If you are satisfied, apply the selected method to the Train and Test set


## Custom functions for engineering numerical variables

In [None]:
from feature_engine import transformation as vt
from feature_engine.discretisation import EqualFrequencyDiscretiser, EqualWidthDiscretiser
from feature_engine.outliers import Winsorizer
from feature_engine.encoding import CountFrequencyEncoder
from feature_engine.encoding import RareLabelEncoder


import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set(style="darkgrid")
import warnings
warnings.filterwarnings('ignore')


def FeatureEngineering(df,analysis_type=None):
    """
    - used for quick feat engineering on numerical variables
    to decide which method can better transform the distribution shape to
    look like more gaussian.
    - Once transformed, use a reporting tool, like sweetviz, to evaluate distributions

    - Transformers applied include: LogTransformer, ReciprocalTransformer,
    PowerTransformer, BoxCoxTransformer, YeoJohnsonTransformer

    """


    ### Check analyis type
    allowed_types= f"'numerical','outlier', 'discretization', 'countfrequency' or 'outlier_winsorizer'. "
    if analysis_type == None:
      raise SystemExit(f"You should pass analysis_type argument: {allowed_types} ")
    if analysis_type not in allowed_types:
      raise SystemExit(f"analysis_type argument should be either {allowed_types}")


    ### Set suffix colummns acording to analysis_type
    if analysis_type=='numerical':
      list_column_methods = ["lte","lt10","rt", "pt","bct","yj"]
    
    elif analysis_type=='outlier':
      list_column_methods = ["lt","rt", "pt","bct","yj"]####
    
    elif analysis_type=='discretization':
      list_column_methods = ['equal_frequency_5intervals',
                             'equal_frequency_10intervals',
                             'equal_width_5intervals',
                             'equal_width_10intervals']
    
    elif analysis_type=='countfrequency':
      list_column_methods = ["count_encoder","frequency_encoder"]

    elif analysis_type=='outlier_winsorizer':
      list_column_methods = ['gaussian', 'iqr']



    df_feat_eng = pd.DataFrame([]) # empty engineered dataframe
    for column in df.columns:

      ### create additional columns (column_method) to apply the methods
      df_feat_eng = pd.concat([df_feat_eng, df[column]], axis=1)
      for method in list_column_methods:
        df_feat_eng[f"{column}_{method}"] = df[column]
        
      ### Apply methods in respectives column_method
      if analysis_type=='numerical':
        df_feat_eng,list_applied_methods = FeatEngineering_Numerical(df_feat_eng,column)
      
      elif analysis_type=='discretization':
        df_feat_eng,list_applied_methods = FeatEngineering_Discretization(df_feat_eng,column)
      
      elif analysis_type=='outlier_winsorizer':
        df_feat_eng,list_applied_methods = FeatEngineering_OutlierWinsorizer(df_feat_eng,column)

      elif analysis_type=='countfrequency':
        df_feat_eng,list_applied_methods = FeatEngineering_CountFrequency(df_feat_eng,column)

      
      # For each variable, assess how the methods perform
      print(f"* Variable Analyzed: {column}")
      print(f"* Applied Methods: {list_applied_methods} \n")
      for col in [column] + list_applied_methods:
        
        if analysis_type!='countfrequency':
          DiagnosticPlots_Numerical(df_feat_eng, col)
        
        else:
          if col == column: 
            DiagnosticPlots_Categories(df_feat_eng, col)
          else:
            DiagnosticPlots_Numerical(df_feat_eng, col)

        print("\n")


    return df_feat_eng


def DiagnosticPlots_Categories(df_feat_eng, col):
  plt.figure(figsize=(20, 5))
  sns.countplot(data=df_feat_eng, x=col,palette=['#432371'])
  plt.xticks(rotation=90) 
  plt.suptitle(f"{col}", fontsize=30,y=1.05)        
  plt.show();
  print("\n")



def DiagnosticPlots_Numerical(df, variable):

    fig, (ax1, ax2,ax3) = plt.subplots(1, 3, figsize=(20, 6))

    sns.histplot(data=df, x=variable, kde=True,element="step",ax=ax1) 
    stats.probplot(df[variable], dist="norm", plot=ax2)
    sns.boxplot(x=df[variable],ax=ax3)
    
    # analysis on outliers
    # shapiro analysis
    ax1.set_title('Histogram')
    ax2.set_title('Probability Plot')
    ax3.set_title('Boxplot')
    fig.suptitle(f"{variable}", fontsize=30,y=1.05)
    plt.show();


def FeatEngineering_CountFrequency(df_feat_eng,column):
  list_methods_worked = []
  ###  CountEncoder
  try: 
    encoder= CountFrequencyEncoder(encoding_method='count',variables = [f"{column}_count_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_count_encoder")
  except: 
    df_feat_eng.drop([f"{column}_count_encoder"],axis=1,inplace=True)

  ###  FrequencyEncoder
  try: 
    encoder= CountFrequencyEncoder(encoding_method='frequency',variables = [f"{column}_frequency_encoder"])
    df_feat_eng = encoder.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_frequency_encoder")
  except: 
    df_feat_eng.drop([f"{column}_frequency_encoder"],axis=1,inplace=True)

  return df_feat_eng,list_methods_worked


def FeatEngineering_OutlierWinsorizer(df_feat_eng,column):

  list_methods_worked = []

  ### Winsorizer gaussian
  try: 
    disc=Winsorizer(
        capping_method='gaussian', tail='both', fold=3,variables = [f"{column}_gaussian"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_gaussian")
  except: 
    df_feat_eng.drop([f"{column}_gaussian"],axis=1,inplace=True)

  ### Winsorizer iqr
  try: 
    disc=Winsorizer(
        capping_method='iqr', tail='both', fold=3,variables = [f"{column}_iqr"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_iqr")
  except: 
    df_feat_eng.drop([f"{column}_iqr"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked






def FeatEngineering_Discretization(df_feat_eng,column):

  list_methods_worked = []

  ### EqualFrequencyDiscretiser
  try: 
    disc= EqualFrequencyDiscretiser(q=5,variables = [f"{column}_equal_frequency_5intervals"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_equal_frequency_5intervals")
  except: 
    df_feat_eng.drop([f"{column}_equal_frequency_5intervals"],axis=1,inplace=True)


  ### EqualFrequencyDiscretiser
  try: 
    disc= EqualFrequencyDiscretiser(q=10,variables = [f"{column}_equal_frequency_10intervals"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_equal_frequency_10intervals")
  except: 
    df_feat_eng.drop([f"{column}_equal_frequency_10intervals"],axis=1,inplace=True)


  ### EqualWidthDiscretiser
  try: 
    disc= EqualWidthDiscretiser(bins=5,variables = [f"{column}_equal_width_5intervals"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_equal_width_5intervals")
  except: 
    df_feat_eng.drop([f"{column}_equal_width_5intervals"],axis=1,inplace=True)


  ### EqualWidthDiscretiser
  try: 
    disc= EqualWidthDiscretiser(bins=10,variables = [f"{column}_equal_width_10intervals"])
    df_feat_eng = disc.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_equal_width_10intervals")
  except: 
    df_feat_eng.drop([f"{column}_equal_width_10intervals"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked
  
  

def FeatEngineering_Numerical(df_feat_eng,column):

  list_methods_worked = []

  ### LogTransformer base e
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_lte"])
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_lte")
  except: 
    df_feat_eng.drop([f"{column}_lte"],axis=1,inplace=True)

    ### LogTransformer base 10
  try: 
    lt = vt.LogTransformer(variables = [f"{column}_lt10"],base='10')
    df_feat_eng = lt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_lt10")
  except: 
    df_feat_eng.drop([f"{column}_lt10"],axis=1,inplace=True)

  ### ReciprocalTransformer
  try:
    rt = vt.ReciprocalTransformer(variables = [f"{column}_rt"])
    df_feat_eng =  rt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_rt")
  except:
    df_feat_eng.drop([f"{column}_rt"],axis=1,inplace=True)

  ### PowerTransformer
  try:
    pt = vt.PowerTransformer(variables = [f"{column}_pt"])
    df_feat_eng = pt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_pt")
  except:
    df_feat_eng.drop([f"{column}_pt"],axis=1,inplace=True)

  ### BoxCoxTransformer
  try:
    bct = vt.BoxCoxTransformer(variables = [f"{column}_bct"])
    df_feat_eng = bct.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_bct")
  except:
    df_feat_eng.drop([f"{column}_bct"],axis=1,inplace=True)


  ### YeoJohnsonTransformer
  try:
    yjt = vt.YeoJohnsonTransformer(variables = [f"{column}_yj"])
    df_feat_eng = yjt.fit_transform(df_feat_eng)
    list_methods_worked.append(f"{column}_yj")
  except:
        df_feat_eng.drop([f"{column}_yj"],axis=1,inplace=True)


  return df_feat_eng,list_methods_worked




## Template for Feature Engineering

* Step 1: Select variable(s) and describe distribution

In [None]:
variables_engineering = []

* Step 2: Select the engineering method(s)

In [None]:
#####

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = Trainset[variables_engineering].copy()

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method for each variable

In [None]:
# use custom function
df_engineering = ....

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * xxx
  * xxxx

* Step 5: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
TrainSet, TestSet = ....

## Numerical Variables

* Step 1: Select variable(s) and describe distribution

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
variables_engineering = TrainSet.select_dtypes(include=numerics ).columns
variables_engineering = ['MinTemp', 'MaxTemp', 'WindGustSpeed', 'WindSpeed9am']

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.transformation import (LogTransformer,
                                           ReciprocalTransformer,
                                           PowerTransformer,
                                           BoxCoxTransformer,
                                           YeoJohnsonTransformer)


* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method for each variable

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='numerical')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effective
  * MinTemp - yj
  * MaxTemp - not applying any method
  * WindGustSpeed - bct
  * WindSpeed9am - yj



* Step 5: If you are satisfied, apply the selected method to the Train and Test set

In [None]:
from feature_engine.transformation import (LogTransformer,ReciprocalTransformer,
                                           PowerTransformer,BoxCoxTransformer,
                                           YeoJohnsonTransformer)
# the steps are: 
# 1 - select given method and respective variable(s)
# 2 - create transformer
# 3 - fit_transform into TrainSet
# 4 - transform into TestSet

variable_bct = ['WindGustSpeed']
bct = BoxCoxTransformer(variables = variable_bct)
TrainSet = bct.fit_transform(TrainSet)
TestSet = bct.transform(TestSet)

variable_yj = ['MinTemp','WindSpeed9am']
yj = YeoJohnsonTransformer(variables=variable_yj)
TrainSet = yj.fit_transform(TrainSet)
TestSet = yj.transform(TestSet)



# lte = LogTransformer(variables = )
# lt10 = LogTransformer(base='10', variables =)
# rt = ReciprocalTransformer(variables =)
# pt = PowerTransformer(variables = )
# bct = BoxCoxTransformer(variables = )
# yjt = YeoJohnsonTransformer(variables = )



## Variable Discretisation

* Step 1: Select variable(s) and describe distribution

In [None]:
variables_engineering= ['Latitude', 'Longitude','Rainfall']
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.discretisation import EqualFrequencyDiscretiser

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='discretization')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * 'Latitude' - equal frequency 5 intervals
  * 'Longitude' - equal frequency 10 intervals
  * 'Rainfall' - no method applied

* Step 5: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
# just reinforcing the methods we will use
from feature_engine.discretisation import EqualWidthDiscretiser
from feature_engine.discretisation import EqualFrequencyDiscretiser


# the steps are: 
# 1 - select given method and respective variable(s)
# 2 - create transformer
# 3 - fit_transform into TrainSet
# 4 - transform into TestSet



variable_equal_freq5 = ['Latitude' ]
disc = EqualFrequencyDiscretiser(q=5,variables = variable_equal_freq5)
TrainSet = disc.fit_transform(TrainSet)
TestSet = disc.transform(TestSet)


variable_equal_freq10 = ['Longitude' ]
disc = EqualFrequencyDiscretiser(q=10,variables = variable_equal_freq10)
TrainSet = disc.fit_transform(TrainSet)
TestSet = disc.transform(TestSet)


# EqualFrequencyDiscretiser: 'q' argument is the number of intervals
# EqualWidthDiscretiser: 'bins' argument is the number of intervals

# disc= EqualWidthDiscretiser(bins=????,variables = )
# disc = EqualFrequencyDiscretiser(q=???? ,variables = )


## Categorical Enconding - RareLabel

* Step 1: Select variable and describe distribution

In [None]:
variables_engineering = ['Cloud9am', 'State']
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.encoding import RareLabelEncoder

* Step 3: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
from feature_engine.encoding import RareLabelEncoder
list_tol = [0.01, 0.03,0.05,0.08,0.1,0.12]
list_tol = [0.07,0.08]


for variable_engi in variables_engineering:
  df_engineering = TrainSet[[variable_engi]].copy()

  for tol in list_tol:
    df_engineering[f"{variable_engi}_rare_tol {str(tol)}"] = df_engineering[[variable_engi]]
    
    
    encoder = RareLabelEncoder(tol=tol, n_categories=2, variables=[f"{variable_engi}_rare_tol {str(tol)}"],
                              replace_with='Rare')
    df_engineering = encoder.fit_transform(df_engineering)
  

  for col in df_engineering.columns: DiagnosticPlots_Categories(df_engineering, col)

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * Clou9am - 4% tol
  * State - 7% tol

* Step 4: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
# just reinforcing the encoder we will use
from feature_engine.encoding import RareLabelEncoder

# the steps are: 
# 1 - select given tolerance and respective variable(s)
# 2 - create transformer
# 3 - fit_transform into TrainSet
# 4 - transform into TestSet

variable_rare= ['clou9am']
encoder = RareLabelEncoder(tol=0.04, n_categories=2, variables=variable_rare])
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)

variable_rare= ['State']
encoder = RareLabelEncoder(tol=0.08, n_categories=2, variables=variable_rare])
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)



## Categorical Enconding - Count Frequency

* Step 1: Select variable(s) and describe distribution

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
variables_engineering = TrainSet.select_dtypes(exclude=numerics ).columns
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.encoding import CountFrequencyEncoder

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method for each variable

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='countfrequency')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effective
  * 'Location'
  * 'WindGustDir'
  * 'WindDir9am'
  * 'WindDir3pm'
  * 'Cloud9am'
  * 'RainToday'
  * 'RainTomorrow'
  * 'State'



* Step 5: If you are satisfied, apply the selected method to the Train and Test set

In [None]:
from feature_engine.encoding import CountFrequencyEncoder

# the steps are: 
# 1 - select given method and respective variable(s)
# 2 - create transformer
# 3 - fit_transform into TrainSet
# 4 - transform into TestSet

variable_count = variables_engineering.to_list()
encoder = CountFrequencyEncoder(encoding_method='count',variables = variable_count)
TrainSet = encoder.fit_transform(TrainSet)
TestSet = encoder.transform(TestSet)






## Handle Outliers(Winsorizer:caps maximum and/or minimum values)

* Step 1: Select variable(s) and describe distribution

* **Quick reminder: The variable(s) has(ve) to numerical**

In [None]:
TrainSet.columns

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
variables_engineering = TrainSet.select_dtypes(include=numerics ).columns
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.outliers import Winsorizer

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='outlier_winsorizer')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * xxx
  * xxxx

* Step 5: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
TrainSet, TestSet = ....

* create a section for it

## Handle Outliers(OutlierTrimmer: removes observations with outliers)

* Step 1: Select variable(s) and describe distribution

* **Quick reminder: The variable(s) has(ve) to numerical**

In [None]:
TrainSet.columns

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
variables_engineering = TrainSet.select_dtypes(include=numerics ).columns
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
from feature_engine.outliers import OutlierTrimmer

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='outlier_trimmer')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * xxx
  * xxxx

* Step 5: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
TrainSet, TestSet = ....

* create a section for it

* Step 1: Select variable(s) and describe distribution

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
variables_engineering = TrainSet.select_dtypes(exclude=numerics ).columns
variables_engineering

* Step 2: Select the engineering method(s)

In [None]:
# from feature_engine.encoding import RareLabelEncoder
from feature_engine.encoding import CountFrequencyEncoder

* Step 3: Create a separate dataframe, with your variable(s)

In [None]:
df_engineering = TrainSet[variables_engineering].copy()
df_engineering.head(3)

* Step 4: Create engineered variables(s) applying the method(s), assess engineered variables distribution and select most suitable method

In [None]:
df_engineering = FeatureEngineering(df=df_engineering,analysis_type='countfrequency')
df_engineering

* For each variable, write you conclusion on how the method(s) look(s) to be effectie
  * xxx
  * xxxx

* Step 5: If you are satisfied, apply the selected method to the Train and Test set


In [None]:
TrainSet, TestSet = ....

* create a section for it

# Save feature engineered data: Train/Test sets 

In [None]:
# TrainSet.to_csv("/content/WalkthroughProject/inputs/datasets/cleaned/TrainSetCleaned.csv",index=False)
# TestSet.to_csv("/content/WalkthroughProject/inputs/datasets/cleaned/TestSetCleaned.csv",index=False)

* You may now go to "Push generated/new files from this session to GitHub Repo" section and push these files to the repo