## <center>Scaling</center>

Some algorithms, like neural networks and SVMs, are very sensitive to the scaling of the data. Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms. Often, this is a simple per-feature rescaling and shift of the data.

There are many scaling algorithms, the most used ones are:
- Standarization
- Normalization

### Standarization

Standardization or Z-Score Normalization is the transformation of features by subtracting from mean and dividing by standard deviation. This is often called as Z-score.

$$x_{new} = \frac{x-\mu}{\sigma}$$

The StandardScaler in scikit-learn ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features.

#### Example

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)

X_train[0]

array([1.522e+01, 3.062e+01, 1.034e+02, 7.169e+02, 1.048e-01, 2.087e-01,
       2.550e-01, 9.429e-02, 2.128e-01, 7.152e-02, 2.602e-01, 1.205e+00,
       2.362e+00, 2.265e+01, 4.625e-03, 4.844e-02, 7.359e-02, 1.608e-02,
       2.137e-02, 6.142e-03, 1.752e+01, 4.279e+01, 1.287e+02, 9.150e+02,
       1.417e-01, 7.917e-01, 1.170e+00, 2.356e-01, 4.089e-01, 1.409e-01])

As we can see the data has different orders of magnitud, so a scaler is required

In [3]:
scaler = StandardScaler()

scaler.fit(X_train) # The scaler has to be fitted only with the training data,
                    # if fitted with the test data the model will not be fairly evaluated

StandardScaler()

In [4]:
# transform data
X_train_scaled = scaler.transform(X_train)


# print dataset properties before and after scaling
print("transformed shape: {}".format(X_train_scaled.shape))
print("per-feature mean before scaling:\n {}".format(X_train.mean(axis=0)))
print("per-feature std before scaling:\n {}".format(X_train.std(axis=0)))

print("per-feature mean after scaling:\n {}".format(X_train_scaled.mean(axis=0))) # should be zeros
print("per-feature std after scaling:\n {}".format(X_train_scaled.std(axis=0))) # Should be ones

# To avoid .transform() we can use fit_transform()

transformed shape: (426, 30)
per-feature mean before scaling:
 [1.41195047e+01 1.93320423e+01 9.19253991e+01 6.56126056e+02
 9.64633333e-02 1.04575516e-01 8.85219054e-02 4.88310070e-02
 1.80740845e-01 6.28127700e-02 4.11284742e-01 1.22736878e+00
 2.89862465e+00 4.14559061e+01 7.02072066e-03 2.58229225e-02
 3.18465390e-02 1.18348709e-02 2.04109296e-02 3.86226596e-03
 1.63016291e+01 2.56733568e+01 1.07488521e+02 8.89164085e+02
 1.32226502e-01 2.55249671e-01 2.70065559e-01 1.14623918e-01
 2.87710798e-01 8.38468075e-02]
per-feature std before scaling:
 [3.59928640e+00 4.34952000e+00 2.48120364e+01 3.61164531e+02
 1.37973673e-02 5.09370908e-02 7.95203274e-02 3.90652971e-02
 2.71231641e-02 6.77976825e-03 2.89636324e-01 5.83210517e-01
 2.09856063e+00 4.92091215e+01 3.08773296e-03 1.81004824e-02
 2.94992779e-02 6.34786633e-03 7.79220571e-03 2.83165677e-03
 4.98490876e+00 6.26863717e+00 3.45686879e+01 5.92364546e+02
 2.22082780e-02 1.54248636e-01 2.03768939e-01 6.66352413e-02
 5.76639437e-02 1.

In [5]:
# Now the test data can be transformed
X_test_scaled = scaler.transform(X_test)

print(X_test_scaled.mean(axis=0)) 
print(X_test_scaled.std(axis=0)) 

# comes close to zero and one, it's clear it won't be zero ore one due to the fact that is transformed with train data

[ 0.0086086  -0.03878258  0.00699749 -0.01362775 -0.02971919 -0.01832081
  0.013881    0.00897744  0.06176414 -0.00889741 -0.08397601 -0.0717422
 -0.06174637 -0.0904677   0.02610589 -0.07579365  0.00636354 -0.02427944
  0.06708252 -0.09465667 -0.02589349  0.00245419 -0.02616429 -0.05763985
  0.02545833 -0.0253996   0.04145461 -0.00105661  0.16317791  0.02338157]
[0.91029049 0.95069544 0.91126505 0.88994655 1.07117801 1.13607535
 1.00638752 0.96940983 1.03724313 1.15261976 0.81050016 0.7555601
 0.83969644 0.63978344 0.88106901 0.9511862  1.08631385 0.8794542
 1.21973437 0.69477324 0.86860917 0.91620326 0.87952945 0.82943534
 1.1040482  1.07384599 1.08777938 0.94132584 1.25430308 1.25880753]


It's important to note that the the y values are not necessary to be scaled!!!

### Normalization

Normalization is the transformation of features by subtracting min and max values from the data.

$$x_{new} = \frac{x-x_{min}}{x_{max} - x_{min}}$$

This scales the range to [0, 1] or sometimes [-1, 1] (when MaxAbsScaler). Geometrically speaking, transformation squishes the n-dimensional data into an n-dimensional unit hypercube. Normalization is useful when there are no outliers as it cannot cope up with them. Usually, we would scale age and not incomes because only a few people have high incomes but the age is close to uniform.

The MinMaxScaler, on the other hand, shifts the data such that all features are exactly between 0 and 1. For the two-dimensional dataset this means all of the data is contained within the rectangle created by the x-axis between 0 and 1 and the y-axis between 0 and 1.

#### Example

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [7]:
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=1)

X_train[0]

array([1.522e+01, 3.062e+01, 1.034e+02, 7.169e+02, 1.048e-01, 2.087e-01,
       2.550e-01, 9.429e-02, 2.128e-01, 7.152e-02, 2.602e-01, 1.205e+00,
       2.362e+00, 2.265e+01, 4.625e-03, 4.844e-02, 7.359e-02, 1.608e-02,
       2.137e-02, 6.142e-03, 1.752e+01, 4.279e+01, 1.287e+02, 9.150e+02,
       1.417e-01, 7.917e-01, 1.170e+00, 2.356e-01, 4.089e-01, 1.409e-01])

As we can see the data has different orders of magnitud, so a scaler is required

In [8]:
scaler = MinMaxScaler()

scaler.fit(X_train) # The scaler has to be fitted only with the training data,
                    # if fitted with the test data the model will not be fairly evaluated

MinMaxScaler()

In [9]:
# transform data
X_train_scaled = scaler.transform(X_train)


# print dataset properties before and after scaling
print("transformed shape: {}".format(X_train_scaled.shape))
print("per-feature min before scaling:\n {}".format(X_train.mean(axis=0)))
print("per-feature max before scaling:\n {}".format(X_train.std(axis=0)))

print("per-feature min after scaling:\n {}".format(X_train_scaled.mean(axis=0))) 
print("per-feature max after scaling:\n {}".format(X_train_scaled.std(axis=0))) 

# To avoid .transform() we can use fit_transform()

transformed shape: (426, 30)
per-feature min before scaling:
 [1.41195047e+01 1.93320423e+01 9.19253991e+01 6.56126056e+02
 9.64633333e-02 1.04575516e-01 8.85219054e-02 4.88310070e-02
 1.80740845e-01 6.28127700e-02 4.11284742e-01 1.22736878e+00
 2.89862465e+00 4.14559061e+01 7.02072066e-03 2.58229225e-02
 3.18465390e-02 1.18348709e-02 2.04109296e-02 3.86226596e-03
 1.63016291e+01 2.56733568e+01 1.07488521e+02 8.89164085e+02
 1.32226502e-01 2.55249671e-01 2.70065559e-01 1.14623918e-01
 2.87710798e-01 8.38468075e-02]
per-feature max before scaling:
 [3.59928640e+00 4.34952000e+00 2.48120364e+01 3.61164531e+02
 1.37973673e-02 5.09370908e-02 7.95203274e-02 3.90652971e-02
 2.71231641e-02 6.77976825e-03 2.89636324e-01 5.83210517e-01
 2.09856063e+00 4.92091215e+01 3.08773296e-03 1.81004824e-02
 2.94992779e-02 6.34786633e-03 7.79220571e-03 2.83165677e-03
 4.98490876e+00 6.26863717e+00 3.45686879e+01 5.92364546e+02
 2.22082780e-02 1.54248636e-01 2.03768939e-01 6.66352413e-02
 5.76639437e-02 1.6

In [10]:
# Now the test data can be transformed
X_test_scaled = scaler.transform(X_test)

print(X_test_scaled.mean(axis=0)) 
print(X_test_scaled.std(axis=0)) 

[0.33931987 0.31969417 0.33383333 0.21535703 0.39201306 0.31521139
 0.20999467 0.24444191 0.38593982 0.27493842 0.09851044 0.18240098
 0.09480498 0.0564105  0.18317057 0.1667244  0.08089459 0.2212682
 0.2194613  0.09326005 0.29322492 0.36430547 0.27976521 0.16462354
 0.40693316 0.24603488 0.23804506 0.39365468 0.33393603 0.31085522]
[0.15506631 0.13984    0.15624588 0.13633812 0.13342454 0.21647603
 0.18750765 0.18822208 0.14208745 0.1717094  0.08512539 0.09738565
 0.08303039 0.05880332 0.09248074 0.1293067  0.08092291 0.10575218
 0.18305543 0.06796841 0.15403548 0.15307425 0.15142278 0.12075504
 0.16191646 0.18189925 0.18944927 0.21555146 0.17188228 0.22711156]


### As we can see...

Normalization and Standarization work very similar, the only difference is in how they scale data.

#### Standardization vs. Normalization: When to Use Each

Typically we normalize data when performing some type of analysis in which we have multiple variables that are measured on different scales and we want each of the variables to have the same range. This prevents one variable from being overly influential, especially if it’s measured in different units (i.e. if one variable is measured in inches and another is measured in yards).


On the other hand, we typically standardize data when we’d like to know how many standard deviations each value in a dataset lies from the mean.

### Robust scaling

$$x_{new} = \frac{x-median}{IQR}$$

The RobustScaler works similarly to the StandardScaler in that it ensures statistical properties for each feature that guarantee that they are on the same scale. However, the RobustScaler uses the median and quartiles,1 instead of mean and variance. This makes the RobustScaler ignore data points that are very different from the rest (like measurement errors). These odd data points are also called outliers, and can lead to trouble for other scaling techniques.

#### Example

In [11]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import numpy as np

In [12]:
X_train = np.array([[1, 2, 3],
                    [4, 5, 6],
                    [7, 999, 999]])

As we can see the data has outliers, so a scaler is required

In [13]:
scaler = RobustScaler()

scaler.fit(X_train) # The scaler has to be fitted only with the training data,
                    # if fitted with the test data the model will not be fairly evaluated

RobustScaler()

In [14]:
# transform data
X_train_scaled = scaler.fit_transform(X_train)

X_train_scaled



array([[-1.        , -0.00601805, -0.0060241 ],
       [ 0.        ,  0.        ,  0.        ],
       [ 1.        ,  1.99398195,  1.9939759 ]])