# Scaling

In diesem Notebook zeigen wir die korrekte Anwendung von Skalierungen in der Data Science. Wir verwenden sowohl die von `sklearn` bereitgestellten Methoden, als auch eine händische Berechnung.

In [1]:
# Import dstools (absolute path required, please change to your systems settings)
import importlib
import sys

path = '/dstools-master/dstools/__init__.py'
name = 'dstools'

spec = importlib.util.spec_from_file_location(name, path)
module = importlib.util.module_from_spec(spec)
sys.modules[spec.name] = module
spec.loader.exec_module(module)

In [4]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

from dstools.datasets import bodyfat
from dstools.tools import quality

## Standardisierung

In [5]:
# Load data and prepare for regression
df = bodyfat()
df.convert(unit="metric")
X, y = df.for_regression()

  df[['Weight', 'Height']] = df[['Weight', 'Height']].applymap(lambda x: round(x, 2))
  df[to_convert] = df[to_convert].applymap(lambda x: round(x * 0.39370, 2))
  df[weights] = df[weights].applymap(lambda x: round(x * 0.45359237, 2))
  df[lengths] = df[lengths].applymap(lambda x: round(x / 0.39370, 2))


In [6]:
quality(X)

Dataframe has 252 rows and 13 columns.

0 column(s) with missing values.

12 column(s) with outliers.



Unnamed: 0,type,unique,missing_abs,missing_rel,outliers_abs,outliers_rel
Age,int64,51,0,0.0,0,0.0
Weight,float64,197,0,0.0,2,0.79
Height,float64,48,0,0.0,1,0.4
Neck,float64,90,0,0.0,3,1.19
Chest,float64,174,0,0.0,2,0.79
Abdomen,float64,185,0,0.0,3,1.19
Hip,float64,152,0,0.0,3,1.19
Thigh,float64,139,0,0.0,4,1.59
Knee,float64,90,0,0.0,3,1.19
Ankle,float64,61,0,0.0,3,1.19


In [7]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=147, test_size = 0.2, shuffle=True)

### StandardScaler

Bei der Verwendung der eingebauten Scaler in `sklearn` sind folgende Schritte notwendig: 

- `fit(train)`: bestimmt den Mittelwert und die Standardabweichung der Daten und speichert sie im Objekt `StandardScaler`
- `transform(train)`: führt die z-Transformation mit den gespeicherten Werten von Mittelwert und Standardabweichung auf den Trainingsdaten aus
- `transform(test)`: führt die z-Transformation mit den gespeicherten Werten von Mittelwert und Standardabweichung auf den Testdaten aus

Für die Trainingsdaten gibt es eine Kurzschreibweise: 

- `fit_transform(train)`: Da werden `fit` und `transform` hintereinander ausgeführt. 

> ***ACHTUNG***: **Niemals** `fit_transform` auf den **Test**daten ausführen!

In [8]:
# Create a StandardScaler object
ssc = StandardScaler()

In [9]:
# Show attributes
ssc.__dict__

{'with_mean': True, 'with_std': True, 'copy': True}

In [10]:
# Apply fit method
ssc.fit(X_train)

In [11]:
# Show attributes
ssc.__dict__

{'with_mean': True,
 'with_std': True,
 'copy': True,
 'feature_names_in_': array(['Age', 'Weight', 'Height', 'Neck', 'Chest', 'Abdomen', 'Hip',
        'Thigh', 'Knee', 'Ankle', 'Biceps', 'Forearm', 'Wrist'],
       dtype=object),
 'n_features_in_': 13,
 'n_samples_seen_': np.int64(201),
 'mean_': array([ 44.62189055,  81.24945274, 178.03935323,  38.04626866,
        100.82363184,  92.52761194,  99.90791045,  59.49432836,
         38.59925373,  23.13507463,  32.37791045,  28.6419403 ,
         18.23139303]),
 'var_': array([160.23514269, 178.91171065,  94.90657719,   5.632101  ,
         66.64389178, 108.72998634,  52.82937971,  27.91957679,
          5.40402581,   3.01287276,   8.96177573,   3.9950226 ,
          0.84594035]),
 'scale_': array([12.65840206, 13.37578823,  9.74200068,  2.3732048 ,  8.16357102,
        10.42736718,  7.26838219,  5.28389788,  2.32465606,  1.73576287,
         2.99362251,  1.99875526,  0.91975016])}

In [12]:
# Apply Z-Transformation to training data
X_train_z = ssc.transform(X_train)
print(X_train_z[0:5,:])

[[-7.60118892e-01 -1.80994588e+00 -1.19783950e+00 -1.70076711e+00
  -1.22662396e+00 -1.67996500e+00 -1.47321786e+00 -1.79873430e+00
  -1.63432939e+00 -6.53934154e-01 -2.53469181e+00 -1.36682082e+00
  -1.45843197e+00]
 [ 1.37285175e+00 -3.93954558e-01  3.66520891e-01 -1.06871040e+00
  -3.93655158e-01 -9.95085262e-02 -1.93703414e-01 -5.49656414e-01
   4.62273488e-03 -4.23487930e-01 -2.93260237e-01 -6.66384886e-01
   3.89896064e-01]
 [-4.91286771e-02  8.77746198e-01 -1.05840019e+01 -6.09415866e-01
   6.32856399e-01  1.12803049e+00  2.14381813e+00  2.10368783e+00
   1.67368684e+00  3.25462298e-01  4.08231014e-01  2.90479293e-02
  -9.03933563e-01]
 [-1.28127590e-01  7.25231822e-01  3.66520891e-01 -6.16333900e-02
   7.79850894e-01  1.01390772e+00  7.70747795e-01  1.76303022e+00
  -1.28730325e-01  3.25462298e-01 -8.94937310e-02  1.34113319e-01
   4.98621242e-01]
 [-2.86125415e-01  1.98907699e-01 -4.80327747e-01 -1.94962764e-02
   3.15593281e-01  8.30735881e-01  4.39174698e-01  3.77689291e-01


In [13]:
# Check if mean = 0 and std = 1
print(np.mean(X_train_z, axis=0))
print(np.std(X_train_z,axis=0))

[-3.53503849e-17 -7.37939284e-16  8.83759622e-16  1.52779945e-15
 -2.42592016e-15  7.95383659e-16  1.64379290e-15 -1.04283635e-15
  8.17477650e-16 -1.02957996e-15  2.01497194e-15  1.11574652e-15
 -1.49134436e-16]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


In [14]:
# Apply Z-Transformation to test data
X_test_z = ssc.transform(X_test)
print(X_test_z[0:5,:])

[[-0.04912868 -0.44479268 -1.1978395   0.43979826 -0.02616892  0.13161405
   0.0264281  -0.11247915 -0.43415185 -0.99384233  0.24120929  0.42429392
  -1.01265874]
 [ 2.32083871  0.9622272  -0.02456921  1.15612919  1.41805199  1.53273475
   0.99087931 -0.03488492  1.5446355   0.84972746  0.44497579  0.67945272
   2.90144769]
 [ 0.34586589 -0.25041162  0.36652089 -1.0687104  -0.37160598 -0.23377061
  -0.56379953 -0.47206218  0.04763985  0.03164336 -1.62609361 -1.07664022
  -0.68648321]
 [-1.31311128 -0.94495012 -0.61069111 -1.49850896 -1.30134617 -1.55625209
  -0.56379953 -0.20899881 -1.32890787 -0.12966899 -0.43021805 -0.32617315
  -0.68648321]
 [ 1.60984849 -1.75088394 -1.13214458 -1.41002102 -0.95958396 -1.22922802
  -1.69334938 -1.6643638  -2.23656902 -1.75431487 -1.2953906  -1.92716956
  -1.87158765]]


In [15]:
# Check mean and std
print(np.mean(X_test_z, axis=0))
print(np.std(X_test_z, axis=0))

[ 0.10267316 -0.03361662  0.0715378  -0.11252841  0.00051586  0.01339197
 -0.00251811 -0.08245824 -0.01933189 -0.09419835 -0.17162912  0.0541616
 -0.01260028]
[0.9634595  0.97284499 0.73718154 1.09992058 1.14352689 1.14968976
 0.9169517  0.95483484 1.16453277 0.86406517 1.02488098 1.04215604
 1.06540557]


### Manuelle Berechnung

Achtung, bei der manuellen Berechnung gibt es einen Unterschied zwischen `numpy` und `pandas`:

In [16]:
# Training-Data mit numpy
X_train_z_man =  (np.array(X_train) - np.mean(np.array(X_train), axis=0))/np.std(np.array(X_train), axis=0)

In [17]:
# Training-Data mit pandas
X_train_z_man_pd = (X_train - X_train.mean(axis=0))/X_train.std(axis=0)

In [18]:
X_train_z_man_pd.mean(axis=0)

Age       -3.976918e-17
Weight    -7.202641e-16
Height     8.660844e-16
Neck       1.560940e-15
Chest     -2.425920e-15
Abdomen    7.953837e-16
Hip        1.643793e-15
Thigh     -1.007486e-15
Knee       8.042213e-16
Ankle     -1.033999e-15
Biceps     2.021600e-15
Forearm    1.102490e-15
Wrist     -1.402968e-16
dtype: float64

In [19]:
X_train_z_man_pd.std(axis=0)

Age        1.0
Weight     1.0
Height     1.0
Neck       1.0
Chest      1.0
Abdomen    1.0
Hip        1.0
Thigh      1.0
Knee       1.0
Ankle      1.0
Biceps     1.0
Forearm    1.0
Wrist      1.0
dtype: float64

In [83]:
X_train_z_man

array([[ -0.76011889,  -1.80994588,  -1.1978395 , ...,  -2.53469181,
         -1.36682082,  -1.45843197],
       [  1.37285175,  -0.39395456,   0.36652089, ...,  -0.29326024,
         -0.66638489,   0.38989606],
       [ -0.04912868,   0.8777462 , -10.58400185, ...,   0.40823101,
          0.02904793,  -0.90393356],
       ...,
       [ -0.99711563,  -1.03017875,  -0.02456921, ...,  -0.86447454,
         -0.97157483,  -1.24098162],
       [  1.0568561 ,  -0.12256868,   0.36652089, ...,  -0.49034587,
          0.47432506,  -0.24070997],
       [  0.4248648 ,  -0.21602112,   0.75761099, ...,  -1.26198625,
         -0.97157483,  -0.24070997]])

In [84]:
X_train_z_man_pd

Unnamed: 0,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
171,-0.758226,-1.805438,-1.194856,-1.696531,-1.223569,-1.675781,-1.469549,-1.794254,-1.630259,-0.652305,-2.528379,-1.363417,-1.454800
70,1.369432,-0.392973,0.365608,-1.066049,-0.392675,-0.099261,-0.193221,-0.548287,0.004611,-0.422433,-0.292530,-0.664725,0.388925
41,-0.049006,0.875560,-10.557641,-0.607898,0.631280,1.125221,2.138479,2.098448,1.669518,0.324652,0.407214,0.028976,-0.901682
202,-0.127808,0.723426,0.365608,-0.061480,0.777909,1.011382,0.768828,1.758639,-0.128410,0.324652,-0.089271,0.133779,0.497379
189,-0.285413,0.198412,-0.479131,-0.019448,0.314807,0.828667,0.438081,0.376749,0.776991,-0.129346,0.340572,0.283499,0.280471
...,...,...,...,...,...,...,...,...,...,...,...,...,...
30,-0.994632,0.096989,0.951294,0.278981,-0.038323,-0.366159,-0.014810,-0.374607,0.047521,6.192141,0.044014,-0.465099,0.172016
217,0.502609,-0.832970,-0.024508,-0.477598,-0.920537,-1.053974,-0.754531,-0.903199,0.167669,-0.301750,-1.622044,-1.363417,0.388925
23,-0.994632,-1.027613,-0.024508,-1.066049,-1.726993,-1.197468,-0.893143,-0.869218,-1.029519,-0.594837,-0.862321,-0.969155,-1.237891
232,1.054224,-0.122263,0.365608,-0.019448,-0.076202,-0.424514,-0.290661,-0.452008,0.124759,0.267184,-0.489125,0.473144,-0.240110


In [21]:
X_train['Age'].std(ddof=0)

np.float64(12.658402059284093)

Grund:

`numpy` berechnet die Standardabweichung mit

$$
\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N(x_i - \mu)^2}
$$

und `pandas` mit

$$
s = \sqrt{\frac{1}{N-1}\sum_{i=1}^N(x_i - \bar{x})^2}
$$


In [None]:
np.array(X_train['Age']).std()

In [None]:
# Check if mean = 0 and std = 1
print(np.mean(X_train_z_man, axis=0))
print(np.std(X_train_z_man,axis=0))

In [None]:
# Test-Data
X_test_z_man =  (X_test - X_train.mean())/X_train.std()