# Data preprocesing

In this notebook, we discuss how to deal with missing values. 

## Imputation of Missing values

A dataset may contain missing values for some features, and some models may not support such a dataset as input. There are various strategies to deal with the missing values which include 

1. Drop the row; causes reduction of data
2. Set a missing value with the mean, median, or constant value of the feature 
3. Interpolate the missing value from other features. Rows having the feature values are used to predict the missing one. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

In [2]:
data = {'name': ['Michael', 'Jessica', 'Sue', 'Jake', 'Amy', 'Tye'],
        'gender':[np.NaN,'F',np.NaN,'F',np.NaN, 'M'],
        'height': [123, 145, 100 , np.NaN, None, 150],
        'weight': [10, np.NaN , 30, np.NaN, None, 20],
        'age': [14, None, 29 , np.NaN, 52, 45],
        }
df = pd.DataFrame(data, columns = ['name','gender', 'height', 'weight', 'age'])
df

Unnamed: 0,name,gender,height,weight,age
0,Michael,,123.0,10.0,14.0
1,Jessica,F,145.0,,
2,Sue,,100.0,30.0,29.0
3,Jake,F,,,
4,Amy,,,,52.0
5,Tye,M,150.0,20.0,45.0


### Dropping rows with NaN/None values

In [3]:
# drop rows with NaN/None values

df_reduced = df.dropna()
df_reduced

Unnamed: 0,name,gender,height,weight,age
5,Tye,M,150.0,20.0,45.0


### Dropping rows that contains less than k features

In [4]:

df_reduced = df.dropna(thresh=3)  # k = 3
df_reduced

Unnamed: 0,name,gender,height,weight,age
0,Michael,,123.0,10.0,14.0
1,Jessica,F,145.0,,
2,Sue,,100.0,30.0,29.0
5,Tye,M,150.0,20.0,45.0


### Imputing the missing value with the most frequent value of the column

In [5]:
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
df_imputed = pd.DataFrame(imputer.fit_transform(df))
df_imputed

Unnamed: 0,0,1,2,3,4
0,Michael,F,123.0,10.0,14.0
1,Jessica,F,145.0,10.0,14.0
2,Sue,F,100.0,30.0,29.0
3,Jake,F,100.0,10.0,14.0
4,Amy,F,100.0,10.0,52.0
5,Tye,M,150.0,20.0,45.0


### Replacing the missing value with constant value

In [6]:
imputer = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1)
df_imputed = pd.DataFrame(imputer.fit_transform(df))
df_imputed

Unnamed: 0,0,1,2,3,4
0,Michael,-1,123.0,10.0,14.0
1,Jessica,F,145.0,-1.0,-1.0
2,Sue,-1,100.0,30.0,29.0
3,Jake,F,-1.0,-1.0,-1.0
4,Amy,-1,-1.0,-1.0,52.0
5,Tye,M,150.0,20.0,45.0
