# Preprocessing

We must transform data into a state suitable for machine learning.  Whatever transformation we need, the scikit-learn preprocessing module lets us accomplish it in just a couple lines of code.

## Standardization

Standardization transforms data to have a mean and unit variance of zero, as expected by many machine learning algorithms.  Therefore, we subtract the mean from each datum and divide all data by the standard deviation.

In [1]:
from sklearn import preprocessing
import numpy as np

Here's some example training data that we want to scale down.

In [2]:
X_train = np.array([[1.0, -1.0, 2.0],
                    [2.0, 0.0, 0.0],
                    [0.0, 1.0, -1.0]])

First, we create a StandardScaler for our data.

In [3]:
scaler = preprocessing.StandardScaler().fit(X_train)

In [4]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

In [5]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

Then, we create a new, scaled dataset!

In [6]:
X_scaled = scaler.transform(X_train)

In [7]:
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

Use this class early in your pipeline to ensure all downstream code is using data of the same scale.

## Non-linear Transformation

Some transforms are more than just subtraction and division.

### Mapping to a uniform distribution

We can establish the quantile of each datum by mapping our data onto a uniform distribution.

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

To demonstrate, we load and split the iris dataset.

In [9]:
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Next, we create a QuantileTransformer from the preprocessing module. 

In [10]:
quantile_transformer = preprocessing.QuantileTransformer(random_state=0, n_quantiles = len(X_train))

Finally, we create uniformly mapped datasets with our transformer.

In [11]:
X_train_trans = quantile_transformer.fit_transform(X_train)
X_test_trans = quantile_transformer.transform(X_test)

Checking the results with the percentile module of numpy reveals the results we would expect.

In [12]:
np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) 

array([4.3, 5.1, 5.8, 6.5, 7.9])

### Mapping to a Gaussian Distribution

Other algorithms require their data to fit a Gaussian distribution because it minimizes the variance and skewness of the data.  The PowerTransformer class offers the Yeo-Johnson and Box-Cox methods, the latter of which works for negative as well as positive numbers and is demonstrated below.

In [13]:
pt = preprocessing.PowerTransformer(method='box-cox', standardize=False)
X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3))

In [14]:
X_lognormal

array([[1.28331718, 1.18092228, 0.84160269],
       [0.94293279, 1.60960836, 0.3879099 ],
       [1.35235668, 0.21715673, 1.09977091]])

In [15]:
pt.fit_transform(X_lognormal)

array([[ 0.49024349,  0.17881995, -0.1563781 ],
       [-0.05102892,  0.58863195, -0.57612415],
       [ 0.69420009, -0.84857822,  0.10051454]])

We can also use the QuantileTransformer class to map data onto normal distributions.

In [16]:
quantile_transformer = preprocessing.QuantileTransformer(
    output_distribution='normal', random_state=0, n_quantiles = len(X))

In [17]:
X_trans = quantile_transformer.fit_transform(X)

In [18]:
quantile_transformer.quantiles_

array([[4.3, 2. , 1. , 0.1],
       [4.4, 2.2, 1.1, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       [4.4, 2.2, 1.2, 0.1],
       [4.5, 2.3, 1.3, 0.1],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.3, 1.3, 0.2],
       [4.6, 2.4, 1.3, 0.2],
       [4.7, 2.4, 1.3, 0.2],
       [4.7, 2.4, 1.3, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.8, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.5, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [4.9, 2.6, 1.4, 0.2],
       [5. , 2.6, 1.4, 0.2],
       [5. , 2.6, 1.4, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5. , 2.7, 1.5, 0.2],
       [5.1, 2.7, 1.5, 0.2],
       [5.1, 2.8, 1.5, 0.2],
       [5.1, 2

### Normalization

Scaling data to have unit norm enables such quadratic methods as dot-product or a kernel to quantify the separation of data.  The preprocessing module has a method for just this purpose! 

In [19]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

In [20]:
X_normalized = preprocessing.normalize(X, norm='l2')

In [21]:
X_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])