# Data Preprocessing with sklearn

This notebook is going to summarize the key data preprocessing techniques and understanding. Sklearn its preprocessing library forms a solid foundation to guide you through this important task in the data science pipeline.

## Outline

* Missing values
* Polynomial features
* Categorical features
* Numerical features
* Custom transformations
* Feature scaling
* Normalization

## Missing values

First of all, it is important to identify the missing values and know with which value they are replaced. The simple answer is the decision should partially depend on how random missing values are.

If they are completely at random, they don’t give any extra information and can be omitted. On the other hand, if they’re not at random, the fact that a value is missing is itself information and can be expressed as an extra binary feature.

In [1]:
import numpy as np
import pandas as pd

# Example Missing Data
data = np.array([5,7,8, np.NaN, np.NaN, np.NaN, -5, 0,25,999,1,-1, np.NaN, 0, np.NaN]).reshape((5,3))
data = pd.DataFrame(data, columns = ['f1', 'f2', 'f3']) #feature 1, feature 2, feature 3
data

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
1,,,
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


Rows or columns with to many non-meaningful missing values can be deleted from you data with `dropna` method. However, it will delete row number the same time, so `reset_index` will be a good idea

* axis: 0 for rows, 1 for columns
* thresh: the number of non-NaN’s exists so that not delete the row or column 
* inplace: update the frame

In [6]:
data.dropna()

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0


In [8]:
data.dropna(thresh=1)

Unnamed: 0,f1,f2,f3
0,5.0,7.0,8.0
2,-5.0,0.0,25.0
3,999.0,1.0,-1.0
4,,0.0,


In [7]:
data.dropna().reset_index()

Unnamed: 0,index,f1,f2,f3
0,0,5.0,7.0,8.0
1,2,-5.0,0.0,25.0
2,3,999.0,1.0,-1.0
