# Feature Engineering

In the feature engineering step, we create new features or modify existing ones to improve the performance of the machine learning model.

Helpful Links:
https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

## Basic - Processes

### Encoding

Encoding is a process used to transform categorical data into numerical values that can be understood by machine learning algorithms. There are several types of encoding techniques used in feature engineering such as one-hot encoding and label encoding. Some of the most commonly used techniques are:

* One-hot encoding
* Label encoding
* Binary encoding
* Count encoding
* Target encoding
* Hashing encoding

One-hot encoding is a technique used to convert categorical data into numerical data by creating a binary vector for each category. For example, if we have a categorical feature called “color” with three categories (red, green, and blue), we can create three binary vectors (one for each category) with a value of 1 for the corresponding category and 0 for the others.

Label encoding is another technique used to convert categorical data into numerical data by assigning a unique integer value to each category. For example, if we have a categorical feature called “color” with three categories (red, green, and blue), we can assign the values 0, 1, and 2 to each category respectively.

Binary encoding is similar to one-hot encoding but uses fewer features. Count encoding replaces each category with the number of times it appears in the dataset. Target encoding replaces each category with the mean target value for that category. Hashing encoding is a technique that maps each category to a fixed-length vector

Source: https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

### Discretization

Discretization is a process used to transform continuous data into categorical data. It involves dividing the range of a continuous variable into a set of intervals or bins and then assigning each value to the corresponding bin.

Binning or discretization is used for the transformation of a continuous or numerical variable into a categorical feature. Binning of continuous variable introduces non-linearity and tends to improve the performance of the model. It can also be used to identify missing values or outliers

Discretization can help improve the classifier by reducing the noise in the data and making it easier for the classifier to identify patterns. By discretizing continuous variables, they may be transformed into categorical variables that are easier to work with. This can help improve the accuracy of the classifier by reducing the number of features and making it easier to identify which features are most important.

The effectiveness of discretization can depend on the model applied. Some models may be more sensitive to the choice of discretization method than others. Many machine learning algorithms perform better when tey are trained with discrete variables. For example, decision trees and random forests can benefit from discretization because they work best with categorical variables. On the other hand, linear regression models may not benefit as much from discretization because they work best with continuous variables

Source: https://towardsdatascience.com/an-intro-to-discretization-techniques-for-machine-learning-93dce1198e68

### Normalization (Standardization)

Normalization (standardization) is a type of feature scaling that adjusts the values of your features to a standard distribution, such as a normal (or Gaussian) distribution, or a uniform distribution. This helps to reduce the skewness, outliers, or heteroscedasticity of your data, which can affect the performance or accuracy of your predictive models. By normalizing the data, it can be ensured that each feature contributes equally to the model and that the model is not biased towards any particular feature.

Four common normalization techniques are scaling to a range, clipping, log scaling, and z-score.

Source: https://developers.google.com/machine-learning/data-prep/transform/normalization 

## Dimensionality - Processes

### Feature Selection
Feature selection is the process of selecting a subset of relevant features from the dataset that can help improve the accuracy, performance, or interpretability of your predictive models. By reducing the number of features, it can reduce the complexity of the model, avoid overfitting, and speed up training and inference. Having irrelevant features in the data can actually decrease the accuracy of the machine learning models.

The top reasons to use feature selection are:
* It enables the machine learning algorithm to train faster.
* It reduces the complexity of a model and makes it easier to interpret.
* It improves the accuracy of a model if the right subset is chosen.
* It reduces overfitting.

Source: https://www.freecodecamp.org/news/feature-engineering-and-feature-selection-for-beginners/

### Dimensionality Reduction
Dimensionality reduction is another technique used in feature engineering that can help reduce the number of features in your dataset while preserving the most important information or patterns. This can help improve the performance, accuracy, or interpretability of your predictive models, especially when dealing with high-dimensional data or noisy data. Some common techniques for dimensionality reduction include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), t-SNE, and Autoencoders .

### Feature Combination
Feature combination is another technique used in feature engineering that can help  create new features by combining or interacting existing features in your dataset. For example, by creating a new feature by multiplying two existing features, or by adding or subtracting two existing features. This can help capture more complex relationships or interactions between features and improve the performance or accuracy of predictive models.

Source: https://towardsdatascience.com/feature-engineering-combination-polynomial-features-3caa4c77a755

## Recombine - Process
In our data there are two types of features that are (almost) one-hot-encoded. Particulary 'has_superstructure_X' and 'has_secondary_use_X'. It could be useful to reconstruct the original categorical features.

## Evaluation of Feature Engineering

* Analysing the relation of the new features to the target value
* Evaluate a simple prediction model with different feature sets
* Analyse the importance of the new features in the model