# Data Preprocessing with sklearn

This notebook is going to summarize the key data preprocessing techniques and understanding. Sklearn its preprocessing library forms a solid foundation to guide you through this important task in the data science pipeline.

## Outline

* Missing values
* Polynomial features
* Categorical features
* Numerical features
* Custom transformations
* Feature scaling
* Normalization

## Missing values

First of all, it is important to identify the missing values and know with which value they are replaced. The simple answer is the decision should partially depend on how random missing values are.

If they are completely at random, they don’t give any extra information and can be omitted. On the other hand, if they’re not at random, the fact that a value is missing is itself information and can be expressed as an extra binary feature.

In [1]:
import numpy as np
import pandas as pd

# Example Missing Data
data = np.array([5,7,8, np.NaN, np.NaN, np.NaN, -5, 0,25,999,1,-1, np.NaN, 0, np.NaN]).reshape((5,3))
data = pd.DataFrame(data, columns = ['f1', 'f2', 'f3']) #feature 1, feature 2, feature 3
data

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


Rows or columns with to many non-meaningful missing values can be deleted from you data with `dropna` method. However, it will delete row number at the same time, so `reset_index` will be a good idea

* axis: 0 for rows, 1 for columns
* thresh: the number of non-NaN’s exists so that not delete the row or column 
* inplace: update the frame

In [6]:
data.dropna()

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0


In [8]:
data.dropna(thresh=1)

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


In [7]:
data.dropna().reset_index()

Unnamed: 0,index,f1,f2,f3
0,0,5.0,7.0,8.0
1,2,-5.0,0.0,25.0
2,3,999.0,1.0,-1.0


## Imputing values

For filling up missing values with common strategies, the pandas provides `replace` and `fillna` method to impute missing value. The four main strategies are mean, most_frequent, median and constant.

This pandas implementation also provides options to fill forward (ffill) or fill backward (bfill), which are convenient when working with time series.

In [18]:
data.replace([999.0,0], np.NaN)

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,,25.0
3,,1.0,-1.0
4,,,


In [20]:
data.fillna(data.mean())

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,333.0,2.0,10.666667
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,333.0,0.0,10.666667


Other popular ways to impute missing data are clustering the data with the k-nearest neighbor (KNN) algorithm or interpolating the values using a wide range of interpolation methods.

## Polynomial features

Creating polynomial features is a simple and common way of feature engineering that adds complexity to numeric input data by combining features. Polynomial features are often created when we want to include the notion that there exists a nonlinear relationship between the features and the target. They are mostly used to add complexity to linear models with little features, or when we suspect the effect of one feature is dependent on another feature.

If you for example replace all the missing values by 0, all the cross-products using this feature will be 0. Moreover, if you don’t replace missing values (NaN), creating polynomial features will raise a value error.Therefore, replacing missing values by the median or the mean seems to be a reasonable choice.

Sklearn provides a `PolynomialFeatures` class to create polynomial features from scratch. The degree parameter determines the maximum degree of the polynomial. For example, when degree is set to two $x_1$, $x_2$, the features created will be 1, $x_1$, $x_2$, $x^2$ and $x_1x_2$. The interaction_only parameter let the function know we only want the interaction features, i.e. 1, x1, x2 and x1x2.

Following example is replacing 999.0 and `NaN` to mean. Then create polynomial features to the third degree and only interaction feature.

In [31]:
from sklearn.preprocessing import PolynomialFeatures

data_clean = data.replace(999.0, np.NaN).fillna(data.mean())
print(data_clean)

poly = PolynomialFeatures(degree=3, interaction_only=True) # only create interaction terms

polynomials = pd.DataFrame(poly.fit_transform(data_clean))
polynomials

      f1   f2         f3
0    5.0  7.0   8.000000
1  333.0  2.0  10.666667
2   -5.0  0.0  25.000000
3  333.0  1.0  -1.000000
4  333.0  0.0  10.666667


Unnamed: 0,0,1,2,3,4,5,6,7
0,1.0,5.0,7.0,8.0,35.0,40.0,56.0,280.0
1,1.0,333.0,2.0,10.666667,666.0,3552.0,21.333333,7104.0
2,1.0,-5.0,0.0,25.0,-0.0,-125.0,0.0,-0.0
3,1.0,333.0,1.0,-1.0,333.0,-333.0,-1.0,-333.0
4,1.0,333.0,0.0,10.666667,0.0,3552.0,0.0,0.0


Just as with any other form of feature engineering, it is important to create polynomial features before doing any feature scaling.

## Categorical features

Munging categorical data is another essential process during data preprocessing.Even for tree-based models, it is necessary to convert categorical features to a numerical representation.

Before you start transforming your data, it is important to figure out if the feature you’re working on is ordinal (as opposed to nominal). An ordinal feature is best described as a feature with natural, ordered categories and the distances between the categories is not known.

In sklearn that will be a `OrdinalEncoder` for ordinal data, and a `OneHotEncoder` for nominal data.

In [None]:


data = pd.DataFrame(
    np.array(['M', 'O-', 'medium',
             'M', 'O-', 'high',
              'F', 'O+', 'high',
              'F', 'AB', 'low',
              'F', 'B+', np.NaN])
              .reshape((5,3)))
X.columns = ['sex', 'blood_type', 'edu_level']