# Introduction to Feature Engineering

* Author: Oliver Kretzschmar
* Last Update: 2019-05-01

### Remember: Our Starting Point is Tidy Data

A dataset in our context is "tidy", if it satisfies the following conditions

* the observations are in the rows
* the variables or features are in the columns
* and contained in a single dataset.

<html><img src="../img/TidyData1.jpg", width=800></html>
<font size="1">(Imagesource: [WH17])</font>

Tidy data is a one of the prerequisites for good machine learning results. To get tidy data we have to do a lot of 
things, e.g. data cleaning, feature engineering, selection, extraction and so on. The process to transform and mapping the raw data into tidy data format is also known as *data wrangling*.


### What is Feature Engineering?

*Features* are known as the input variables of a data set. As we have seen in a former section, the part of preparing the data in a machine learning process is very important and mostly time-consuming.

`What are we doing in Feature Engineering?` There are a lot of things to do, e.g. investigate the data by visualizing and other methods, handle missing values, cleaning and transforming features into formats so that they are suitable for our machine learning model and so on.


`What are we doing in Feature Extraction and Selection?` Because of lot of machine learning models are very sensitive about wrong features (we discussed this in context of *domain knowledge*) - independent of computer time and other disadvantages, we have to find the *right* features for our use case and remove the other ones (`Discussion:` What means here *right*?) .

After a short introduction to data, we will check out different methods and techniques of feature engineering, extraction and selection as described in [ZC18], [KJK19], [REF19] and others.


### How many Data do we need?

We can not answer this question exactly, but we know some influencing factors:

* Complexity of model
* Variance of value distribution
* Number of features and learning parameters
* Influences between feature and target variable
* Expected uniqueness of (inferential) statistical conclusions

<font size="2">(Source: Urban, Dieter ; Mayerl, Jochen: Angewandte Regressionsanalyse: Theorie, Technik und Praxis. 5. Springer VS, 2018)</font>


A common rule say that the necessary quantity of data is exponentially increasing, if the number of features increase - especially in Deep Learning. So for that reason in most situations it is necessary to do a carefully consideration of data as we have seen before - so called Data Cleaning, Feature Engineering, Selection, Extraction a.s.o. See also [CLO09]).
    
What can we do, if we have to low data?

- Augmentation
- Better Data Evaluation and Feature Engineering
- Active Learning - careful selection of trainings samples
- Apply pretrained models
- Resampling methods
<BR>
    
<font size="2">(Source: Maucher, J.: Introduction to Artificial Intelligence, Machine Learning and Deep Learning, IHK Workshop, 2018)</font>
<BR>

### Influence of Data Scale Levels to Models

Different scale levels of the variables:

<html><img src="../img/ScaleLevels.jpg"></html>
<font size="2">(Imagesource: [BEP10])</font>


The scale level of the variables (input/output) influence the set of possible applicable machine learning algorithm. For example:

<html><img src="../img/MLAlgo4ScaleLevels.jpg"></html>
<font size="2">(Imagesource: [BEP10])</font>