# Data Preprocessing with sklearn

This notebook is going to summarize the key data preprocessing techniques and understanding. Sklearn its preprocessing library forms a solid foundation to guide you through this important task in the data science pipeline.

## Outline

* Missing values
* Polynomial features
* Categorical features
* Numerical features
* Custom transformations
* Feature scaling
* Normalization

## Missing values

First of all, it is important to identify the missing values and know with which value they are replaced. The simple answer is the decision should partially depend on how random missing values are.

If they are completely at random, they don’t give any extra information and can be omitted. On the other hand, if they’re not at random, the fact that a value is missing is itself information and can be expressed as an extra binary feature.

In [50]:
import numpy as np
import pandas as pd

# Example Missing Data
data = np.array([5,7,8, np.NaN, np.NaN, np.NaN, -5, 0,25,999,1,-1, np.NaN, 0, np.NaN]).reshape((5,3))
data = pd.DataFrame(data, columns = ['f1', 'f2', 'f3']) #feature 1, feature 2, feature 3
data

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


Rows or columns with to many non-meaningful missing values can be deleted from you data with `dropna` method. However, it will delete row number at the same time, so `reset_index` will be a good idea

* axis: 0 for rows, 1 for columns
* thresh: the number of non-NaN’s exists so that not delete the row or column 
* inplace: update the frame

In [51]:
data.dropna()

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0


In [52]:
data.dropna(thresh=1)

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


In [53]:
data.dropna().reset_index()

Unnamed: 0,index,f1,f2,f3
0,0,5.0,7.0,8.0
1,2,-5.0,0.0,25.0
2,3,999.0,1.0,-1.0


## Imputing values

For filling up missing values with common strategies, the pandas provides `replace` and `fillna` method to impute missing value. The four main strategies are mean, most_frequent, median and constant.

This pandas implementation also provides options to fill forward (ffill) or fill backward (bfill), which are convenient when working with time series.

In [54]:
data.replace([999.0,0], np.NaN)

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,,25.0
3,,1.0,-1.0
4,,,


In [55]:
data.fillna(data.mean())

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,333.0,2.0,10.666667
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,333.0,0.0,10.666667


Other popular ways to impute missing data are clustering the data with the k-nearest neighbor (KNN) algorithm or interpolating the values using a wide range of interpolation methods.

## Polynomial features

Creating polynomial features is a simple and common way of feature engineering that adds complexity to numeric input data by combining features. Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target. They are mostly used to add complexity to linear models with little features, or when we suspect the effect of one feature is dependent on another feature.

If you for example replace all the missing values by 0, all the cross-products using this feature will be 0. Moreover, if you don’t replace missing values (NaN), creating polynomial features will raise a value error.Therefore, replacing missing values by the median or the mean seems to be a reasonable choice.

Sklearn provides a `PolynomialFeatures` class to create polynomial features from scratch. The degree parameter determines the maximum degree of the polynomial. For example, when degree is set to two $x_1$, $x_2$, the features created will be 1, $x_1$, $x_2$, $x^2$ and $x_1x_2$. The interaction_only parameter let the function know we only want the interaction features, i.e. 1, x1, x2 and x1x2.

Following example is replacing 999.0 and `NaN` to mean. Then create polynomial features to the third degree and only interaction feature.

In [56]:
from sklearn.preprocessing import PolynomialFeatures

data_clean = data.replace(999.0, np.NaN).fillna(data.mean())
print(data_clean)

poly = PolynomialFeatures(degree=3, interaction_only=True) # only create interaction terms

polynomials = pd.DataFrame(poly.fit_transform(data_clean))
polynomials

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  333.0  1.0  -1.000000
4  333.0  0.0  10.666667


Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,5.0,7.0,8.0,35.0,40.0,56.0,280.0
1,1.0,333.0,2.0,10.666667,666.0,3552.0,21.333333,7104.0
2,1.0,-5.0,0.0,25.0,-0.0,-125.0,0.0,-0.0
3,1.0,333.0,1.0,-1.0,333.0,-333.0,-1.0,-333.0
4,1.0,333.0,0.0,10.666667,0.0,3552.0,0.0,0.0


Just as with any other form of feature engineering, it is important to create polynomial features before doing any feature scaling.

## Categorical features

Munging categorical data is another essential process during data preprocessing. Even for tree-based models, it is necessary to convert categorical features to a numerical representation.

Before you start transforming your data, it is important to figure out if the feature you’re working on is ordinal (as opposed to nominal). An ordinal feature is best described as a feature with natural, ordered categories and the distances between the categories is not known.

In [57]:
x = np.array(['M', 'O', 'medium', 'M', 'O', 'high','F', 'A', 'high', 'F', 'AB', 'low','F', 'B', np.NaN]).reshape(5,3)

data2 = pd.DataFrame(x, columns = ['sex', 'blood_type', 'edu_level'])
data2

Unnamed: 0,sex,blood_type,edu_level
0,M,O,medium
1,M,O,high
2,F,A,high
3,F,AB,low
4,F,B,


The most popular way to encode nominal features is one-hot-encoding. Essentially, each categorical feature with n categories is transformed into n binary features.

* categories: unique value per feature, set as training set feature, or input customerized feature list
* sparse: return matrix if True, else return array
* dtype: set as np.int
* handle_unknown: set ignore for missing feature 

In [58]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(dtype=np.int,sparse=True, handle_unknown='ignore')
pd.DataFrame(enc.fit_transform(data2).toarray(), columns=enc.get_feature_names())

Unnamed: 0,x0_F,x0_M,x1_A,x1_AB,x1_B,x1_O,x2_high,x2_low,x2_medium,x2_nan
0,0,1,0,0,0,1,0,0,1,0
1,0,1,0,0,0,1,1,0,0,0
2,1,0,1,0,0,0,1,0,0,0
3,1,0,0,1,0,0,0,1,0,0
4,1,0,0,0,1,0,0,0,0,1


The `factorize` method provides an alternative that can handle missing values and respects the order of our values.

The results are more satisfying this time as the data is numerical, still ordered and the missing values are replaced by 0. Note that replacing missing values with the smallest value might not always be the best choice. Other options are to put it in the most common category

In [60]:
cat = pd.Categorical(data2.edu_level, categories=['missing', 'low', 'medium', 'high'], ordered=True)
cat = cat.fillna('missing')
print(cat)
labels, unique = pd.factorize(cat, sort=True)
data2['edu_lvl'] = labels
data2

[medium, high, high, low, missing]
Categories (4, object): [missing < low < medium < high]


Unnamed: 0,sex,blood_type,edu_level,edu_lvl
0,M,O,medium,2
1,M,O,high,3
2,F,A,high,3
3,F,AB,low,1
4,F,B,,0


## Numerical features

Just like categorical data can be encoded, numerical features can be decoded into categorical features. The two most common ways to do this are **discretization** and **binarization**.

**Discretization**: divides a continuous feature into a pre-specified number of categories. One of the main goals of a discretization is to significantly reduce the number of discrete intervals of a continuous attribute. Hence, why this transformation can increase the performance of tree based models.

Sklearn provides a `KBinsDiscretizer` class that can take care of this. The only thing you have to specify are the number of bins (n_bins) for each feature and how to encode these bins (ordinal, onehot or onehot-dense).

The optional strategy parameter can be set to three values:

* uniform: all bins in each feature have identical widths,but it is very sensitive for outliers
* quantile (default): all bins in each feature have the same number of points.
* kmeans: all values in each bin have the same nearest center of a 1D k-means cluster.

In [61]:
from sklearn.preprocessing import KBinsDiscretizer

x = [[-2, 1, -4,   -1],
     [-1, 2, -3, -0.5],
     [ 0, 3, -2,  0.5],
     [ 1, 4, -1,    2]]
est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
est.fit_transform(x)

array([[0., 0., 0., 0.],
       [1., 1., 1., 0.],
       [2., 2., 2., 1.],
       [2., 2., 2., 2.]])

In [64]:
from sklearn.preprocessing import KBinsDiscretizer

print(data2.edu_lvl)
disc = KBinsDiscretizer(n_bins=3, encode='onehot', strategy='uniform')

disc.fit_transform(np.array(data2.edu_lvl).reshape(-1,1)).toarray()

0    2
1    3
2    3
3    1
4    0
Name: edu_lvl, dtype: int64


array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

**Binarization**: Feature binarization is the process of tresholding numerical features to get boolean values. Or in other words, assign a boolean value (True or False) to each sample based on a threshold.

In general binarization is useful as a feature engineering technique for creating new features that indicate something meaningful. The `Binarizer` class in sklearn implements binarization in a very intuitive way. The only parameters you need to specify are the threshold and copy. All values below or equal to the threshold are replaced by 0, above it by 1. 

In [65]:
from sklearn.preprocessing import Binarizer

x = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]

# binary threshold 0
transformer = Binarizer(threshold = 0)
transformer.fit_transform(x)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

## Feature scaling - Standardization

Before applying any scaling transformations, it is very important to **split your data into a train set and a test set**. If you start scaling before, your training (and test) data might end up scaled around a mean value (see below) that is not actually the mean of the train or test data, and go past the whole reason why you’re scaling in the first place.

Standardization is a transformation that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. After standardizing data the mean will be zero and the standard deviation one.

Standardization can drastically improve the performance of models. For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. **If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.**

**Standard Scaler**: It purely centers the data by using the following formula, where $\mu$ is the mean and $\sigma$ is the standard deviation. Remember that the value of our fourth instance was missing, and we replaced it by the mean.

std_scaler = $\frac{x-\mu}{\sigma}$

In [66]:
from sklearn.preprocessing import StandardScaler

print('Original Data')
print(data, '\n')

print('Filled Missing value')
print(data.fillna(data.mean()))

scaler = StandardScaler()
scaler.fit_transform(data.fillna(data.mean()))

Original Data
      f1   f2    f3
0    5.0  7.0   8.0
1    NaN  NaN   NaN
2   -5.0  0.0  25.0
3  999.0  1.0  -1.0
4    NaN  0.0   NaN 

Filled Missing value
      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  999.0  1.0  -1.000000
4  333.0  0.0  10.666667


array([[-0.89913037,  1.91741247, -0.31933647],
       [ 0.        ,  0.        ,  0.        ],
       [-0.92654289, -0.76696499,  1.71643352],
       [ 1.82567326, -0.38348249, -1.39709705],
       [ 0.        , -0.76696499,  0.        ]])

**MinMax Scaler**: The MinMaxScaler transforms features by scaling each feature to a given range. This range can be set by specifying the feature_range parameter. **This scaler works better for cases where the distribution is not Gaussian or the standard deviation is very small**. However, it is **sensitive to outliers**, so if there are outliers in the data, you might want to consider another scaler.

MinMax_scaler = $\frac{x-min(x)}{max(x) - min(x)}$

In [72]:
from sklearn.preprocessing import MinMaxScaler

print(data.fillna(data.mean()))
scaler = MinMaxScaler(feature_range=(-10,10))
scaler.fit_transform(data.fillna(data.mean()))

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  999.0  1.0  -1.000000
4  333.0  0.0  10.666667


array([[ -9.80079681,  10.        ,  -3.07692308],
       [ -3.26693227,  -4.28571429,  -1.02564103],
       [-10.        , -10.        ,  10.        ],
       [ 10.        ,  -7.14285714, -10.        ],
       [ -3.26693227, -10.        ,  -1.02564103]])

**MaxAbs Scaler**: The MaxAbsScaler works very similarly to the MinMaxScaler but automatically scales the data to a [-1, 1] range based on the **absolute maximum**. This scaler is meant for **data that is already centered at zero or sparse data**. It does not shift/center the data, and thus does not destroy any sparsity.

MaxAbs_scaler = $\frac{x}{Max|x|}$

In [73]:
from sklearn.preprocessing import MaxAbsScaler

print(data.fillna(data.mean()))
scaler = MaxAbsScaler()
scaler.fit_transform(data.fillna(data.mean()))

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  999.0  1.0  -1.000000
4  333.0  0.0  10.666667


array([[ 0.00500501,  1.        ,  0.32      ],
       [ 0.33333333,  0.28571429,  0.42666667],
       [-0.00500501,  0.        ,  1.        ],
       [ 1.        ,  0.14285714, -0.04      ],
       [ 0.33333333,  0.        ,  0.42666667]])

**Robust Scaler**: If your data contains many outliers, scaling using the mean and standard deviation of the data is likely to not work very well. In these cases, you can use the `RobustScaler`. **It removes the median and scales the data according to the quantile range.**

By default, the scaler uses the Inter Quartile Range (IQR), which is the range between the 1st quartile and the 3rd quartile. The quantile range can be manually set by specifying the quantile_range parameter when initiating a new instance of the `RobustScaler`.

In [74]:
from sklearn.preprocessing import RobustScaler

print(data.fillna(data.mean()))
robust = RobustScaler(quantile_range = (0.1,0.9))
robust.fit_transform(data.fillna(data.mean()))

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  999.0  1.0  -1.000000
4  333.0  0.0  10.666667


array([[-1.02500000e+03,  6.00000000e+00, -9.25925926e+00],
       [ 0.00000000e+00,  1.00000000e+00,  0.00000000e+00],
       [-1.05625000e+03, -1.00000000e+00,  4.97685185e+01],
       [ 2.08125000e+03,  0.00000000e+00, -4.05092593e+01],
       [ 0.00000000e+00, -1.00000000e+00,  0.00000000e+00]])

## Feature scaling - Normalization

Normalization is the process of **scaling individual samples to have unit norm**. In basic terms you need to normalize data when the algorithm predicts based on the weighted relationships formed between data points. Scaling inputs to unit norms is a common operation for text classification or clustering.

**One of the key differences between standardization and normalization is that normalization is a row-wise operation, while standardization is a column-wise operation.**

Although there are many other ways to normalize data, sklearn provides three norms: **L1, L2 and Max**. When creating a new instance of the Normalizer class you can specify the desired norm under the norm parameter.

**Max**: The max norm uses the absolute maximum and does for samples what the MaxAbsScaler does for features.

Max_normalizer = $\frac{x}{Max(x)}$

In [107]:
x = data.fillna(data.mean())
print(x)

norm_max = list(max(list(abs(i) for i in x.iloc[r])) for r in range(len(x)))

max_norm_data = pd.DataFrame(np.zeros((5,3)), columns=['f1','f2','f3'])
for i in range(len(x)):
    max_norm_data.iloc[i,] = x.iloc[i]/norm_max[i]

max_norm_data

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  999.0  1.0  -1.000000
4  333.0  0.0  10.666667


Unnamed: 0,f1,f2,f3
0,0.625,0.875,1.0
1,1.0,0.006006,0.032032
2,-0.2,0.0,1.0
3,1.0,0.001001,-0.001001
4,1.0,0.0,0.032032


**L1 Norm**: using the sum of all the values as and thus gives equal penalty to all parameters, enforcing sparsity.

L1_norm = $\frac{x}{sum(x)}$

In [112]:
norm_l1 = list(sum(list(abs(i) for i in x.iloc[r])) for r in range(len(x)))

norm_l1_data = pd.DataFrame(np.zeros((5,3)), columns=['f1','f2','f3'])
for i in range(len(x)):
    norm_l1_data.iloc[i,] = x.iloc[i]/norm_l1[i]
    
norm_l1_data

Unnamed: 0,f1,f2,f3
0,0.25,0.35,0.4
1,0.963356,0.005786,0.030858
2,-0.166667,0.0,0.833333
3,0.998002,0.000999,-0.000999
4,0.968962,0.0,0.031038


**L2 Norm**: using the square root of the sum of all the squared values. This creates smoothness and rotational invariance. Some models, like PCA, assume rotational invariance, and so l2 will perform better.

L2_norm = $\frac{x}{\sqrt{\sum{x_i^2}}}$

In [115]:
import math
norm_l2 = list(math.sqrt(sum(list((i**2) for i in x.iloc[r]))) for r in range(len(x)))

norm_l2_data = pd.DataFrame(np.zeros((5,3)), columns=['f1','f2','f3'])
for i in range(len(x)):
    norm_l2_data.iloc[i,] = x.iloc[i]/norm_l2[i]
    
norm_l2_data

Unnamed: 0,f1,f2,f3
0,0.425628,0.59588,0.681005
1,0.999469,0.006003,0.032015
2,-0.196116,0.0,0.980581
3,0.999999,0.001001,-0.001001
4,0.999487,0.0,0.032016
