# Feature Engineering

In the feature engineering step, we create new features or modify existing ones to improve the performance of the machine learning model.

Helpful Links:
https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

In [1]:
# enables referencing modules in repository
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import statistics as stat 
import pipeline as pipe

from scipy.stats import chi2_contingency
from src.features import build_features
# from src.data import make_dataset 
# commented out because: there seems to be an issue at the moment with the initial method from make_dataset
from src.models import train_model
from src.models import predict_model
from src.visualization import visualize
from tabulate import tabulate
from scipy import stats
from src.pipelines.build_pipelines import CustomPipeline, get_best_steps
from sklearn import set_config

from src.features.build_features import *
from category_encoders.binary import BinaryEncoder
from sklearn.preprocessing import LabelEncoder, RobustScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.tree import DecisionTreeClassifier

The pipeline below includes the current best steps we found out to create a model with the highest evaluation values.

In [4]:
# Confirm the pipeline is working in the notebook
testpip = CustomPipeline(get_best_steps(), apply_ordinal_encoding=True, force_data_cleaning=False)
testpip.run()

loading data
preparing data
running pipeline
evaluating pipeline
    fit_time: 0.23234968185424804
    score_time: 0.028180503845214845
    test_accuracy: 0.70739972086257
    test_f1-score: 0.6118018211054055
    test_mcc: 0.4365692823656165
storing model and prediction


In [5]:
set_config(display="diagram")
testpip.pipeline

In [6]:
estimator = DecisionTreeClassifier()
baseline_pipe = CustomPipeline(steps=[('estimator', estimator)], skip_evaluation=False)
baseline_pipe.run()

loading data
preparing data
running pipeline
evaluating pipeline
    fit_time: 0.18929929733276368
    score_time: 0.016856908798217773
    test_accuracy: 0.6685463596585042
    test_f1-score: 0.564354039222139
    test_mcc: 0.3647947219811763
storing model and prediction


The model above will be used as a baseline model for a decision tree classifier

## Basic - Processes

### Encoding (Patrick)

Encoding is a process used to transform categorical data into numerical values that can be understood by machine learning algorithms. There are several types of encoding techniques used in feature engineering such as one-hot encoding and label encoding. Some of the most commonly used techniques are:

* One-hot encoding
* Label encoding
* Binary encoding
* Count encoding
* Target encoding (Thomas)
* Hashing encoding

One-hot encoding is a technique used to convert categorical data into numerical data by creating a binary vector for each category. For example, if we have a categorical feature called “color” with three categories (red, green, and blue), we can create three binary vectors (one for each category) with a value of 1 for the corresponding category and 0 for the others.

Label encoding is another technique used to convert categorical data into numerical data by assigning a unique integer value to each category. For example, if we have a categorical feature called “color” with three categories (red, green, and blue), we can assign the values 0, 1, and 2 to each category respectively.

Binary encoding is similar to one-hot encoding but uses fewer features. Count encoding replaces each category with the number of times it appears in the dataset. Target encoding replaces each category with the mean target value for that category. Hashing encoding is a technique that maps each category to a fixed-length vector

Source: https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

### Discretization (Marco)

Discretization is a process used to transform continuous data into categorical data. It involves dividing the range of a continuous variable into a set of intervals or bins and then assigning each value to the corresponding bin.

Binning or discretization is used for the transformation of a continuous or numerical variable into a categorical feature. Binning of continuous variable introduces non-linearity and tends to improve the performance of the model. It can also be used to identify missing values or outliers

Discretization can help improve the classifier by reducing the noise in the data and making it easier for the classifier to identify patterns. By discretizing continuous variables, they may be transformed into categorical variables that are easier to work with. This can help improve the accuracy of the classifier by reducing the number of features and making it easier to identify which features are most important.

The effectiveness of discretization can depend on the model applied. Some models may be more sensitive to the choice of discretization method than others. Many machine learning algorithms perform better when tey are trained with discrete variables. For example, decision trees and random forests can benefit from discretization because they work best with categorical variables. On the other hand, linear regression models may not benefit as much from discretization because they work best with continuous variables

Source: https://towardsdatascience.com/an-intro-to-discretization-techniques-for-machine-learning-93dce1198e68

Below we take a look at equal frequency discretization. This entails transforming continuous data into bins, with each bin having the same (or similar) number of records.

In [7]:
from sklearn.preprocessing import KBinsDiscretizer
    
# trains and predicts on the transformed data
estimator = DecisionTreeClassifier()
bestScore = {}
bestBins = 0
for i in range(2,10):    
    kbins = KBinsDiscretizer(n_bins=i, strategy='quantile', encode='ordinal')
    testpip = CustomPipeline([     
        ('discretizer', CustomColumnTransformer([
            ('bins', kbins, make_column_selector(dtype_include=['float64']))
        ], remainder='passthrough')),
        ('estimator', estimator)],
        skip_evaluation=True,
        skip_storing=True)
    testpip.run()
    testpip.evaluate(testpip.pipeline, testpip.X_train, testpip.y_train, False)
    
    if len(bestScore) <= 0:
        bestScore = testpip.evaluation_scoring
        bestBins = i
        
    elif bestScore['test_mcc'].mean() < testpip.evaluation_scoring['test_mcc'].mean():
        bestScore = testpip.evaluation_scoring
        bestBins = i

print(f'The best score was achieved with bins of {bestBins}:')
for score in bestScore:
    print('    ' + score + ':', bestScore[score].mean())

loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
The best score was achieved with bins of 2:
    fit_time: 0.17308621406555175
    score_time: 0.021747493743896486
    test_accuracy: 0.6976863404460938
    test_f1-score: 0.6012375853468177
    test_mcc: 0.41910157810277127


Compared to the baseline: </br>
The test_accuracy improved from 0.665 to 0.697.</br>
The f1_score improved from 0.562 to 0.601.</br>
The mcc score improved from 0.359 to 0.418

Below we take a look at equal-width discretization. As the name suggests it transforms data into bins with the same width.

In [8]:
# trains and predicts on the transformed data
estimator = DecisionTreeClassifier()
bestScore = {}
bestBins = 0
for i in range(2,10):    
    kbins = KBinsDiscretizer(n_bins=i, strategy='uniform', encode='ordinal')
    testpip = CustomPipeline([     
        ('discretizer', CustomColumnTransformer([
            ('bins', kbins, make_column_selector(dtype_include=['float64']))
        ], remainder='passthrough')),
        ('estimator', estimator)],
        skip_evaluation=True,
        skip_storing=True)
    testpip.run()
    testpip.evaluate(testpip.pipeline, testpip.X_train, testpip.y_train, False)
    
    if len(bestScore) <= 0:
        bestScore = testpip.evaluation_scoring
        bestBins = i
        
    elif bestScore['test_mcc'].mean() < testpip.evaluation_scoring['test_mcc'].mean():
        bestScore = testpip.evaluation_scoring
        bestBins = i

print(f'The best score was achieved with bins of {bestBins}:')
for score in bestScore:
    print('    ' + score + ':', bestScore[score].mean())

loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
The best score was achieved with bins of 2:
    fit_time: 0.16273894309997558
    score_time: 0.021566247940063475
    test_accuracy: 0.7075452266655814
    test_f1-score: 0.611777011823585
    test_mcc: 0.4368278155365372


Compared to the baseline: </br>
The test_accuracy improved from 0.665 to 0.707.</br>
The f1_score improved from 0.562 to 0.611.</br>
The mcc score improved from 0.359 to 0.437

Below we take a look at k-means discretization. The k-means discretization entails using the k-means clustering algorithm to assign data points to bins.

In [9]:
# trains and predicts on the transformed data
estimator = DecisionTreeClassifier()
bestScore = {}
bestBins = 0
for i in range(2,10):    
    kbins = KBinsDiscretizer(n_bins=i, strategy='kmeans', encode='ordinal')
    testpip = CustomPipeline([     
        ('discretizer', CustomColumnTransformer([
            ('bins', kbins, make_column_selector(dtype_include=['float64']))
        ], remainder='passthrough')),
        ('estimator', estimator)],
        skip_evaluation=True,
        skip_storing=True)
    testpip.run()
    testpip.evaluate(testpip.pipeline, testpip.X_train, testpip.y_train, False)
    
    if len(bestScore) <= 0:
        bestScore = testpip.evaluation_scoring
        bestBins = i
        
    elif bestScore['test_mcc'].mean() < testpip.evaluation_scoring['test_mcc'].mean():
        bestScore = testpip.evaluation_scoring
        bestBins = i

print(f'The best score was achieved with bins of {bestBins}:')
for score in bestScore:
    print('    ' + score + ':', bestScore[score].mean())

loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
loading data
preparing data
running pipeline
The best score was achieved with bins of 2:
    fit_time: 0.23662548065185546
    score_time: 0.026444244384765624
    test_accuracy: 0.6997054714674864
    test_f1-score: 0.607576314494165
    test_mcc: 0.4224944366934965


Compared to the baseline: </br>
The test_accuracy improved from 0.665 to 0.699.</br>
The f1_score improved from 0.562 to 0.607.</br>
The mcc score improved from 0.359 to 0.421

### Normalization (Standardization) (Patrick)

Normalization (standardization) is a type of feature scaling that adjusts the values of your features to a standard distribution, such as a normal (or Gaussian) distribution, or a uniform distribution. This helps to reduce the skewness, outliers, or heteroscedasticity of your data, which can affect the performance or accuracy of your predictive models. By normalizing the data, it can be ensured that each feature contributes equally to the model and that the model is not biased towards any particular feature.

Four common normalization techniques are scaling to a range, clipping, log scaling, and z-score.

Source: https://developers.google.com/machine-learning/data-prep/transform/normalization 

## Dimensionality - Processes

### Feature Selection (Patrick)
Feature selection is the process of selecting a subset of relevant features from the dataset that can help improve the accuracy, performance, or interpretability of your predictive models. By reducing the number of features, it can reduce the complexity of the model, avoid overfitting, and speed up training and inference. Having irrelevant features in the data can actually decrease the accuracy of the machine learning models.

The top reasons to use feature selection are:
* It enables the machine learning algorithm to train faster.
* It reduces the complexity of a model and makes it easier to interpret.
* It improves the accuracy of a model if the right subset is chosen.
* It reduces overfitting.

Source: https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

### Dimensionality Reduction
Dimensionality reduction is another technique used in feature engineering that can help reduce the number of features in your dataset while preserving the most important information or patterns. This can help improve the performance, accuracy, or interpretability of your predictive models, especially when dealing with high-dimensional data or noisy data. Some common techniques for dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-SNE, and Autoencoders .

### Feature Combination (Marco)
Feature combination is another technique used in feature engineering that can help  create new features by combining or interacting existing features in your dataset. For example, by creating a new feature by multiplying two existing features, or by adding or subtracting two existing features. This can help capture more complex relationships or interactions between features and improve the performance or accuracy of predictive models.

Source: https://towardsdatascience.com/feature-engineering-combination-polynomial-features-3caa4c77a755

In [2]:
# combine geo_level_1_id and geo_level_2_id
estimator = DecisionTreeClassifier()
combine = CombineFeatureTransformer('geo_level_1_id', 'geo_level_2_id')
remove = RemoveFeatureTransformer(['geo_level_1_id', 'geo_level_2_id', 'geo_level_3_id'])

testpip = CustomPipeline([     
        ('combine_geoLevels', combine),
        ('remove_singleGeoLevels', remove),
        ('estimator', estimator)
        ],
        skip_evaluation=False,
        skip_storing=True)
testpip.run()

loading data
preparing data
running pipeline
evaluating pipeline
    fit_time: 0.3593448162078857
    score_time: 0.046203088760375974
    test_accuracy: 0.6401884897249006
    test_f1-score: 0.5297284170485315
    test_mcc: 0.3111103133610721


## Recombine - Process
In our data there are two types of features that are (almost) one-hot-encoded. Particulary 'has_superstructure_X' and 'has_secondary_use_X'. It could be useful to reconstruct the original categorical features.

## Evaluation of Feature Engineering (Thomas)

* Analysing the relation of the new features to the target value
* Evaluate a simple prediction model with different feature sets
* Analyse the importance of the new features in the model