# Data Transformation

In [23]:
import numpy as np
import pandas as pd

np.random.seed(101)
mat = np.random.randint(1,101,(100,5))
df = pd.DataFrame(mat)
print('Before:')
print(df)

Before:
      0   1   2   3    4
0    96  12  82  71   64
1    88  76  10  78   41
2     5  64  41  61   93
3    65   6  13  94   41
4    50  84   9  30   60
5    35  45  73  20   11
6    77  96  88   1   74
7     9  63  37  84  100
8    29  64   8  11   53
9    57  39  74  53   19
10   72  16  45   1   13
11   18  76  80  98   94
12   25  37  64  20   36
13   31  11  61  21   28
14    9  87  27  88   47
15   48  55  87  10   46
16    3  19  59  93   12
17   11  95  36  29    4
18   84  85  48  15   70
19   61  70  52   7   89
20   72  69  24  36   80
21   99  68  83  58   78
22   47   4  47  30   87
23   22  22  82  24   95
24   72  21  28  76    6
25   50  87  90  64   83
26   78   4  57  15   50
27   88  53  14  48   50
28   25  21  65  53   61
29   48  30  61  54   12
..  ...  ..  ..  ..  ...
70   53   8  41  74   87
71   15  50  98  26   58
72   41  18  33  84   98
73   28  48  14  71   16
74   93  19  95  49   66
75   83  35   6  47   84
76   28  27  21  88   85
77   18  60  65  

# Rescale Data

When your data is comprised of attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to all have the same scale.

Often this is referred to as normalization and attributes are often rescaled into the range between 0 and 1. This is useful for optimization algorithms in used in the core of machine learning algorithms like gradient descent. It is also useful for algorithms that weight inputs like regression and neural networks and algorithms that use distance measures like K-Nearest Neighbors.

#### After rescaling you can see that all of the values are in the range between 0 and 1.


In [36]:
# Standardize data (0 mean, 1 stdev)
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
rescaled_data = scaler.fit_transform(df)

print('After Rescaling:')
np.set_printoptions(precision=3)
print(rescaled_data[0:5,:])

After Rescaling:
[[ 0.959  0.104  0.821  0.722  0.633]
 [ 0.876  0.771  0.063  0.794  0.398]
 [ 0.021  0.646  0.389  0.619  0.929]
 [ 0.639  0.042  0.095  0.959  0.398]
 [ 0.485  0.854  0.053  0.299  0.592]]


# Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

It is most suitable for techniques that assume a Gaussian distribution in the input variables and work better with rescaled data, such as linear regression, logistic regression and linear discriminate analysis.

#### The values for each attribute now have a mean value of 0 and a standard deviation of 1.


In [37]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

scaler = StandardScaler()
standardized_data = scaler.fit_transform(df)
print('After Standardization:')
np.set_printoptions(precision=3)
print(standardized_data[0:5,:])


After Standardization:
[[ 1.554 -1.403  1.041  0.667  0.303]
 [ 1.264  0.911 -1.554  0.91  -0.498]
 [-1.744  0.477 -0.437  0.32   1.314]
 [ 0.431 -1.62  -1.446  1.464 -0.498]
 [-0.113  1.2   -1.591 -0.754  0.164]]


# Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called a unit norm in linear algebra).

This preprocessing can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as K-Nearest Neighbors.

#### The rows are normalized to length 1.

In [41]:
# Normalize data (length of 1)
from sklearn.preprocessing import Normalizer

scaler = Normalizer().fit(df)
normalizedX = scaler.transform(df)
# summarize transformed data
numpy.set_printoptions(precision=3)
print('After Normalizing :')
print(normalizedX[0:5,:])


After Normalizing :
[[ 0.604  0.076  0.516  0.447  0.403]
 [ 0.602  0.52   0.068  0.533  0.28 ]
 [ 0.037  0.475  0.304  0.453  0.69 ]
 [ 0.532  0.049  0.106  0.769  0.335]
 [ 0.421  0.706  0.076  0.252  0.505]]


# Binarize Data (Make Binary)

You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

#### You can see that all values equal or less than the threshold are marked 0 and all of those above 0 are marked 1.


In [33]:
# binarization
from sklearn.preprocessing import Binarizer
import numpy as np
import pandas as pd

binarizer = Binarizer(threshold=50).fit(df)
binaryX = binarizer.transform(df)
# summarize transformed data
np.set_printoptions(precision=3)
print(binaryX[0:5,:])



[[1 0 1 1 1]
 [1 1 0 1 0]
 [0 1 0 1 1]
 [1 0 0 1 0]
 [0 1 0 0 1]]


Quelle: https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/