In [1]:
import numpy as np
import sklearn.preprocessing

# A. Data outliers
An important aspect of data that we have to deal with is outliers. In general terms, an outlier is a data point that is significantly further away from the other data points. For example, if we had watermelons of weights 5, 4, 6, 7, and 20 pounds, the 20 pound watermelon is an outlier.


The data scaling methods from the previous two chapters are both affected by outliers. Data standardization uses each feature's mean and standard deviation, while ranged scaling uses the maximum and minimum feature values, meaning that they're both susceptible to being skewed by outlier values.

We can robustly scale the data, i.e. avoid being affected by outliers, by using use the data's median and Interquartile Range (IQR). Since the median and IQR are percentile measurements of the data (50% for median, 25% to 75% for the IQR), they are not affected by outliers. For the scaling method, we just subtract the median from each data value then scale to the IQR.

# B. Robust scaling with scikit-learn
In scikit-learn, we perform robust scaling with the RobustScaler module. It is another transformer object, with the same fit, transform, and fit_transform functions described in the previous chapter.

The code below shows how to use the RobustScaler.

Robust scaling is a method of scaling features in a dataset, similar to other scaling techniques like standardization or normalization. The goal of robust scaling is to center and scale the features, making them suitable for certain machine learning algorithms or statistical analyses.

The robust scaling process involves the following steps:

Centering (Median Centering): Subtract the median from each feature. This is done to center the data around the median instead of the mean, making it less sensitive to the influence of outliers.

Scaling (Interquartile Range Scaling): Divide each feature by the interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile). This scaling method is less influenced by extreme values (outliers) compared to standardization or normalization.

The formula for robust scaling for a single feature X is given by

## &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; X<sub>robust scaled</sub> = (X - median(X)) / IQR(X)
##### Here, median(X) is the median of a feature(X), and IQR(X) is the interquartile range of X.

Robust scaling is particularly useful when dealing with datasets that contain outliers, as it reduces the impact of extreme values on the scaling process. It's commonly used in situations where the distribution of features may not be normal or when the presence of outliers can significantly affect the performance of a model.

In [2]:
data = np.array([[ 1.2,  2.3],
                 [ 2.1,  4.2],
                 [-1.9,  3.1],
                 [-2.5,  2.5],
                 [ 0.8,  3. ],
                 [ 6.3,  2.1],
                 [-1.5,  2.7],
                 [ 1.4,  2.9],
                 [ 1.8,  3.2]])

print('{}\n'.format(repr(data)))

from sklearn.preprocessing import RobustScaler
robust_scaler = RobustScaler()
transformed = robust_scaler.fit_transform(data)
print('{}\n'.format(repr(transformed)))

array([[ 1.2,  2.3],
       [ 2.1,  4.2],
       [-1.9,  3.1],
       [-2.5,  2.5],
       [ 0.8,  3. ],
       [ 6.3,  2.1],
       [-1.5,  2.7],
       [ 1.4,  2.9],
       [ 1.8,  3.2]])

array([[ 0.        , -1.        ],
       [ 0.27272727,  2.16666667],
       [-0.93939394,  0.33333333],
       [-1.12121212, -0.66666667],
       [-0.12121212,  0.16666667],
       [ 1.54545455, -1.33333333],
       [-0.81818182, -0.33333333],
       [ 0.06060606,  0.        ],
       [ 0.18181818,  0.5       ]])

