# Data Preprocessing

In this notebook we will discuss the quality of the data that goes in to machine learning algorithms and different techniques to help us build good machine learning models. 

## Dealing with missing data
In this section we will try several practical techniques for dealing with missing values. 

### Identify missing values in tabular data

Let's create a pandas data frame:


In [1]:
import pandas as pd
from io import StringIO

csv_data ='''A,B,C,D
            1.0,2.0,3.0,4.0
            5.0,6.0,,8.0
            10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [2]:
df.isnull().sum()

A    0
B    0
C    1
D    1
dtype: int64

This way we can count the number of missing values in each column. 

Note that scikitlearn originally was developed for working with ordinary numpy arrays rather than dataframes from pandas. Thus, the scikitlearn API is more mature when working with pure numpy, even if it technically support dataframes aswell. Note also that we always can retrieve the numpy array of a column by using the _values_ attribute. 

In [3]:
df.values, df['A'].values

(array([[ 1.,  2.,  3.,  4.],
        [ 5.,  6., nan,  8.],
        [10., 11., 12., nan]]),
 array([ 1.,  5., 10.]))

### Eliminating examples or features with missing values

_Rows_ with missing values can easily be dropped via the _dropna_ method. Note that axis = 0 is the default value:

In [4]:
df.dropna(axis = 0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [5]:
df.dropna()

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


Similarly we can drop _features_ using _axis = 1_:

In [6]:
df.dropna(axis = 1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


We can also choose a _thresshold_ value of how many real values that are required in order to drop:

In [7]:
# Require at least 4 real values in order not to drop an example
df.dropna(thresh=4)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [8]:
# Only drop rows where all columns are Nan, which
# in our df, is none of the rows.
df.dropna(how='all')

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


### Imputing missing values

Sometimes it is of course not viable to delete rows and columns, since our ML algorithm might not get a sufficient amount of features/examples in order to facilitate robust learning. 

One of the most common techniques is to _interpolate_ using existing values. One such example would be the popular __mean imputation__, where we replace the missing value with the mean value of the entire feature column.

scikit-learn has a class for handeling this:

In [9]:
from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [10. , 11. , 12. ,  6. ]])