Lecture: AI I - Basics 

Previous:
[**Chapter 3.6: Additional Libraries and Tools**](../03_data/06_additionals.ipynb)

---

# Chapter 4.1: Data Preparation with scikit-learn

- [Imputation](#imputation)
- [Scaling](#scaling)
- [Dimensionality Reduction](#dimensionality-reduction)
- [Pipelines](#pipelines)
- [Feature Union](#feature-union)
- [Column Transformations](#column-transformations)

__Scikit-learn__ (also known as __sklearn__) is an open-source software library for machine learning in Python. It is very popular and actively maintained. The library offers various classification, regression, and clustering algorithms. In addition, sklearn also includes algorithms for model selection, dimensionality reduction, and data preprocessing.  

In this notebook, we (again) focus on data preprocessing (Data Preparation) and cover the following topics:
- Imputation  
- Scaling  
- Dimensionality Reduction  
- Pipelines  
- Feature Union  
- Column Transformations  

The documentation for scikit-learn can be found [here](https://scikit-learn.org/stable/index.html).  


In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns

## Imputation

Imputation refers to the completion of missing values (NaNs). As in pandas, there are various methods in sklearn to replace missing values. More information can be found [here](https://scikit-learn.org/stable/modules/impute.html).  


In [3]:
nan_data = np.array([[1, 2], [np.nan, 3], [7, 6]])

### One-dimensional Imputation

In one-dimensional imputation, the values are replaced column by column. The class [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) provides basic strategies for this purpose. Missing values can be replaced with a given constant value or with statistical values (mean, median, or most frequent value) of each column containing the missing values. This class also allows for different encodings of missing values.  


In [4]:
from sklearn.impute import SimpleImputer

# define different imputer
mean_imputer = SimpleImputer()
zero_imputer = SimpleImputer(strategy='constant', fill_value=0)

In [5]:
# First the imputer has to be fitted to the data, so either call first fit and then transform 
# or call fit_transform to do it in one step
mean_imputer.fit(nan_data)
mean_imputer.transform(nan_data)

array([[1., 2.],
       [4., 3.],
       [7., 6.]])

In [6]:
zero_imputer.fit_transform(nan_data)

array([[1., 2.],
       [0., 3.],
       [7., 6.]])

In [7]:
different_nan_data = np.array([[np.nan, 5], [8, 2], [6, 6]])
mean_imputer.transform(different_nan_data)

array([[4., 5.],
       [8., 2.],
       [6., 6.]])

__Brainstorming:__  
<details>
<summary>Why was np.nan replaced with 4?</summary>
Because the mean_imputer was previously "trained" on the other data.  
</details>

<details>
<summary>When can this behavior be an advantage?</summary>
An advantage is that the imputation strategies or exact values used during training can also be applied at test time or in live operation.  
</details>


In [8]:
zero_imputer.transform(different_nan_data)

array([[0., 5.],
       [8., 2.],
       [6., 6.]])

It is also possible to replace values other than `np.nan`.  


In [9]:
fischers_fritz = [
    ['Fischers', '', 'fischt', 'frische', 'Fische'],
    ['Frische', 'Fische', 'fischt', 'Fischers', '']
]

string_imputer = SimpleImputer(missing_values='', strategy='constant', fill_value='Fritz')
string_imputer.fit_transform(fischers_fritz)          

array([['Fischers', 'Fritz', 'fischt', 'frische', 'Fische'],
       ['Frische', 'Fische', 'fischt', 'Fischers', 'Fritz']], dtype=object)

### Multidimensional Variant
The multidimensional variant is significantly more complex. Roughly summarized, each missing value is modeled as a function of other features, and this estimate is then used for imputation. This process is repeated several times before the final replacements are made. This behavior is implemented by the [IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer).  

> __Note:__ This class is still experimental.  


In [10]:
from sklearn.experimental import enable_iterative_imputer  

from sklearn.impute import IterativeImputer

First, we create a small dummy dataset where the values of the individual columns have a clear relationship to each other.  


In [11]:
x = np.arange(1, 11, dtype="float")
y = x * x 

data = np.array([x, y])
data[(0, 1)] = np.nan
data[(0, 6)] = np.nan
data[(1, 3)] = np.nan
data[(1, 9)] = np.nan
data = data.T
data

array([[ 1.,  1.],
       [nan,  4.],
       [ 3.,  9.],
       [ 4., nan],
       [ 5., 25.],
       [ 6., 36.],
       [nan, 49.],
       [ 8., 64.],
       [ 9., 81.],
       [10., nan]])

Afterwards, we can again use `fit_transform` to replace the data.  


In [12]:
IterativeImputer().fit_transform(data)

array([[ 1.        ,  1.        ],
       [ 2.3175117 ,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.35578423],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.58808359, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.09539114]])

The replacements of the `SimpleImputer`, on the other hand, look as follows.  


In [13]:
SimpleImputer().fit_transform(data)

array([[ 1.   ,  1.   ],
       [ 5.75 ,  4.   ],
       [ 3.   ,  9.   ],
       [ 4.   , 33.625],
       [ 5.   , 25.   ],
       [ 6.   , 36.   ],
       [ 5.75 , 49.   ],
       [ 8.   , 64.   ],
       [ 9.   , 81.   ],
       [10.   , 33.625]])

__Brainstorming:__  
<details>
    <summary>What are the advantages of the multidimensional variant?</summary>
    A major advantage is that the missing data is replaced depending on other data. This is often better, as there may be dependencies between the data. Example: height and weight of individuals.
</details>


In the following three lines of code, the iterative working method of the algorithm becomes visible.  


In [14]:
IterativeImputer(max_iter=1).fit_transform(data)



array([[ 1.        ,  1.        ],
       [ 3.10315829,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 20.73213454],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.8956479 , 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 82.62527756]])

In [15]:
IterativeImputer(max_iter=2).fit_transform(data)



array([[ 1.        ,  1.        ],
       [ 2.34645245,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.28201783],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.61040064, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.06934333]])

In [16]:
IterativeImputer(max_iter=3).fit_transform(data)

array([[ 1.        ,  1.        ],
       [ 2.3175117 ,  4.        ],
       [ 3.        ,  9.        ],
       [ 4.        , 22.35578423],
       [ 5.        , 25.        ],
       [ 6.        , 36.        ],
       [ 6.58808359, 49.        ],
       [ 8.        , 64.        ],
       [ 9.        , 81.        ],
       [10.        , 83.09539114]])

## Scaling

### Standardization
Standardization refers to a transformation of the input data such that the resulting data has a mean of 0 and a variance of 1. The resulting data is therefore normally distributed. This is particularly helpful when all variables are scaled differently.  

__Example:__ A dataset contains the values height in meters and weight in kilograms of people. The variance of the variables will clearly differ, since height is on a scale from 0.1 m to 2.8 m, while weight is on a scale from 0.5 kg to 600 kg.  

In sklearn, the class [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) is used for standardization.  


In [17]:
from sklearn.preprocessing import StandardScaler

scaling_data = np.array([
    [1.79, 79.5], 
    [1.60, 53.2], 
    [2.59, 150], 
    [1.73, 70.7], 
    [1.50, 46.7], 
    [1.75, 113.0], 
    [1.93, 247.2]
])
print(f"Mittelwerte pro Feature: {scaling_data.mean(axis=0)}")
print(f"Standardabweichung pro Feature: {scaling_data.std(axis=0)}")

Mittelwerte pro Feature: [  1.84142857 108.61428571]
Standardabweichung pro Feature: [ 0.33090476 65.60408151]


In [18]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(scaling_data)
scaled_data

array([[-0.15541805, -0.44378772],
       [-0.72960139, -0.84467741],
       [ 2.26219602,  0.63084054],
       [-0.3367391 , -0.57792572],
       [-1.03180315, -0.94375661],
       [-0.27629875,  0.06685124],
       [ 0.26766441,  2.11245568]])

In [19]:
print(f"Mittelwerte pro Feature: {scaled_data.mean(axis=0)}")
print(f"Standardabweichung pro Feature: {scaled_data.std(axis=0)}")

Mittelwerte pro Feature: [-3.17206578e-17  6.34413157e-17]
Standardabweichung pro Feature: [1. 1.]


__Brainstorming:__  
<details>
<summary>Why isn’t the mean at 0?</summary>
These are calculation errors that can be neglected. e-17 is an extremely small number.
</details>


If the data should not be centered, this can be prevented with the parameter `with_mean=False`.  


In [20]:
not_center_scaler = StandardScaler(with_mean=False)
non_centric_data = not_center_scaler.fit_transform(scaling_data)

print(f"Mittelwerte pro Feature: {non_centric_data.mean(axis=0)}")
print(f"Standardabweichung pro Feature: {non_centric_data.std(axis=0)}")
non_centric_data

Mittelwerte pro Feature: [5.56482953 1.65560257]
Standardabweichung pro Feature: [1. 1.]


array([[5.40941148, 1.21181485],
       [4.83522814, 0.81092516],
       [7.82702555, 2.28644311],
       [5.22809043, 1.07767685],
       [4.53302638, 0.71184595],
       [5.28853078, 1.72245381],
       [5.83249394, 3.76805824]])

It is also possible not to scale the data in order to preserve the variance. For this, the constructor must be called with the parameter `with_std=False`.  


In [21]:
not_scaling_scaler = StandardScaler(with_std=False)
non_scaled_data = not_scaling_scaler.fit_transform(scaling_data)

print(f"Mittelwerte pro Feature: {non_scaled_data.mean(axis=0)}")
print(f"Standardabweichung pro Feature: {non_scaled_data.std(axis=0)}")
non_scaled_data

Mittelwerte pro Feature: [0. 0.]
Standardabweichung pro Feature: [ 0.33090476 65.60408151]


array([[-5.14285714e-02, -2.91142857e+01],
       [-2.41428571e-01, -5.54142857e+01],
       [ 7.48571429e-01,  4.13857143e+01],
       [-1.11428571e-01, -3.79142857e+01],
       [-3.41428571e-01, -6.19142857e+01],
       [-9.14285714e-02,  4.38571429e+00],
       [ 8.85714286e-02,  1.38585714e+02]])

### Scaling features to a specific range

Alternatively, features can be scaled so that they lie between a given minimum and maximum value, often between zero and one, or such that the maximum absolute value of each feature is scaled to unit size. This can be achieved with [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#) or [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#).  


In [22]:
from sklearn.preprocessing import MinMaxScaler
min_max_data = np.array([
    [-1, 2], 
    [-0.5, 6], 
    [0, 10], 
    [1, 18]
])

zero_one_scaler = MinMaxScaler()
zero_one_scaler.fit_transform(min_max_data)

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

To change the range, the parameter `feature_range` can be specified in the constructor.  


In [23]:
minus_one_one_scaler = MinMaxScaler(feature_range=(-1, 1))
minus_one_one_scaler.fit_transform(min_max_data)

array([[-1. , -1. ],
       [-0.5, -0.5],
       [ 0. ,  0. ],
       [ 1. ,  1. ]])

In [24]:
from sklearn.preprocessing import MaxAbsScaler
max_abs_data = np.array([
    [ 1., -1.,  2.],
    [ 2.,  0.,  0.],
    [ 0.,  1., -1.]
])

max_abs_scaler = MaxAbsScaler()
max_abs_scaler.fit_transform(max_abs_data)

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

### Problems with Outliers
The three previously introduced classes are not particularly good at handling outliers during scaling. This problem is illustrated [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py).  

A scaler that works with outliers is the [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html).  


In [25]:
from sklearn.preprocessing import RobustScaler
outlier_data = np.array([
    [2, 4, 1, 27, 3, 4, 1, 3, 3, 2],
    [100, 92, 87, 94, 95, 83, 177, 84, 99, 89]
]).T

robust_scaler = RobustScaler(quantile_range=(25, 75))
robust_scaler.fit_transform(outlier_data)

array([[-0.57142857,  0.66666667],
       [ 0.57142857, -0.0952381 ],
       [-1.14285714, -0.57142857],
       [13.71428571,  0.0952381 ],
       [ 0.        ,  0.19047619],
       [ 0.57142857, -0.95238095],
       [-1.14285714,  8.        ],
       [ 0.        , -0.85714286],
       [ 0.        ,  0.57142857],
       [-0.57142857, -0.38095238]])

In [26]:
robust_scaler.scale_

array([ 1.75, 10.5 ])

In [27]:
robust_scaler.center_

array([ 3., 93.])

In comparison, the StandardScaler would scale the data as follows.  


In [28]:
StandardScaler().fit_transform(outlier_data)

array([[-0.40525742,  0.        ],
       [-0.13508581, -0.30477573],
       [-0.54034323, -0.49526056],
       [ 2.97188775, -0.2285818 ],
       [-0.27017161, -0.19048483],
       [-0.13508581, -0.64764842],
       [-0.54034323,  2.93346637],
       [-0.27017161, -0.60955145],
       [-0.27017161, -0.03809697],
       [-0.40525742, -0.41906662]])

__Exercise:__  
<details>
<summary>How are the new values calculated?</summary>
First, the median and the quantiles per column must be determined. Then the new values can be calculated as follows:  

$$x_{new} = \frac{x_{old} - median}{quantile_{upper} - quantile_{lower}}$$
</details>


---

Lecture: AI I - Basics 

Exercise: [**Exercise 4.1: Data Preparation**](../04_ml/exercises/01_data_preparation.ipynb)

Next: [**Chapter 4.2: Machine Learning with scikit-learn**](../04_ml/02_machine_learning.ipynb)