Reference: https://towardsdatascience.com/imputing-missing-values-using-the-simpleimputer-class-in-sklearn-99706afaff46

In [17]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer

One of the tasks that you need to perform prior to training your machine learning model is data preprocessing. Data cleansing is one key part of the data preprocessing task, and usually involves removing rows with empty values, or replacing them with some imputed values.

>The word “**impute**” means a value assigned to something by inference from the value of the products or processes to which it contributes. In statistics, imputation is the process of replacing missing data with substituted values.

In this Tutorial, We will see how to use the SimpleImputer class in sklearn to quickly and easily replace missing values in your Pandas dataframes.

In [29]:
# Load Sample Data

file_path = "C:/MyLearn/DataSet/Pandas/Tutorials/NanDataset.csv"
df = pd.read_csv(file_path)
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,,6.0,Good
2,7,,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


The sample dataset has missing values in the column `B`, `C` and `D`. Column `B` and `C` is numberic and Column `D` is characters. 

## Replacing Missing Values

All the missing values in the dataframe are represented using `NaN`. Usually, you can either drop them, or replace them with some inferred values. For example, to fill the NaN in the B column with the mean, 

In [30]:
# The empty values in column B are now filled 
# with the mean of that column

    
df['B'].fillna(df['B'].mean(), inplace=True)
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.166667,6.0,Good
2,7,11.166667,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


The empty values in column B are now filled with the mean of that column.

This is straight-forward, but sometimes your fill strategy might be a little different. Instead of filling missing values with the mean of the column, you might want to fill it with a value that most frequently occurs. 
A good example is column D, where the most occurring value is “Excellent”.

To fill the missing value in column D with the most frequently occurring value, you can use the following statement.

In [37]:
df['D'].value_counts()

Excellent    3
Good         2
Fair         2
Name: D, dtype: int64

In [38]:
df['D'].fillna(df['D'].value_counts().index[0], inplace=True)
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.166667,6.0,Good
2,7,11.166667,9.0,Excellent
3,10,11.0,12.0,Excellent
4,13,14.0,15.0,Excellent
5,16,17.0,,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


## Using sklearn’s SimpleImputer Class


An alternative to using the fillna() method is to use the SimpleImputer class from sklearn. You can find the SimpleImputer class from the sklearn.impute package. The easiest way to understand how to use it is through an example:

You first initialize an instance of the SimpleImputer class by indicating the strategy (mean) as well as specifying the missing values that you want to locate (np.nan):

```python
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
```
Once the instance is created, you use the fit() function to fit the imputer on the column(s) that you want to work on:

```python
imputer = imputer.fit(df[['B']])
```
You can now use the transform() function to fill the missing values based on the strategy you specified in the initializer of the SimpleImputer class:

```python
df['B'] = imputer.transform(df[['B']])
```

>Take note that both the fit() and transform() functions **expect a 2D array**, so be sure to pass in a 2D array or dataframe. If you pass in a 1D array or a Pandas Series, you will get an error.

The transform() function return the result as a 2D array. In our example, we assign the value back to the column B.


In [46]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(df[['B']])
df['B'] = imputer.transform(df[['B']])
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.166667,6.0,Good
2,7,11.166667,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


## Replacing Multiple Columns

To replace the missing values for multiple columns in your dataframe, you just need to pass in a dataframe containing the relevant columns:

In [49]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.166667,6.0,Good
2,7,11.166667,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,11.428571,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


The above example replaces the missing values in columns B and C using the “mean” strategy

## Replacing using the median

Instead of using the mean of each column to update the missing values, you can also use median

In [52]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='median', missing_values=np.nan)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.5,6.0,Good
2,7,11.5,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,12.0,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


## Replacing with the most frequent value

If you want to replace missing values with the most frequently-occurring value, use the “most_frequent” strategy

In [53]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='most_frequent', 
                        missing_values=np.nan)
imputer = imputer.fit(df[['D']])
df[['D']] = imputer.transform(df[['D']])
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,,6.0,Good
2,7,,9.0,Excellent
3,10,11.0,12.0,Excellent
4,13,14.0,15.0,Excellent
5,16,17.0,,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


This strategy is useful for categorical column (although it also works for numerical columns). The above code snippet returns the following result

## Replacing with a fixed value

Another strategy you can use is replacing missing values with a fixed (constant) value. To do this, specify “constant” for strategy and specify the fill value using the fill_value parameter

The below code snippet replaces all the missing values in columns B and C with 0's

In [54]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='constant',
                        missing_values=np.nan, fill_value=0)
imputer = imputer.fit(df[['B','C']])
df[['B','C']] = imputer.transform(df[['B','C']])
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,0.0,6.0,Good
2,7,0.0,9.0,Excellent
3,10,11.0,12.0,
4,13,14.0,15.0,Excellent
5,16,17.0,0.0,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


## Applying the SimpleImputer to the Entire Dataframe

If you want to apply the same strategy to the entire dataframe, you can call the fit() and transform() functions with the dataframe. When the result is returned, you can use the `iloc[]` indexer method to update the dataframe as the transform() function return the result as a 2D array. 

In [58]:
df = pd.read_csv(file_path)

imputer = SimpleImputer(strategy='most_frequent', 
                        missing_values=np.nan)
imputer = imputer.fit(df)
df.iloc[:,:] = imputer.transform(df)
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,Good
1,4,11.0,6.0,Good
2,7,11.0,9.0,Excellent
3,10,11.0,12.0,Excellent
4,13,14.0,15.0,Excellent
5,16,17.0,12.0,Fair
6,19,12.0,12.0,Excellent
7,20,11.0,23.0,Fair


Another technique is to create a new dataframe using the result returned by the transform() function

In [27]:
df = pd.DataFrame(imputer.transform(df.loc[:,:]), 
                  columns = df.columns)
df

Unnamed: 0,A,B,C,D
0,1,2.0,3.0,'Good'
1,4,11.166667,6.0,'Good'
2,7,11.166667,9.0,'Excellent'
3,10,11.0,12.0,'Excellent'
4,13,14.0,15.0,'Excellent'
5,16,17.0,11.428571,'Fair'
6,19,12.0,12.0,'Excellent'
7,20,11.0,23.0,'Fair'


>In the above example, the “most_frequent” strategy is applied to the entire dataframe. If you use the median or mean strategies, you will get an error as column D is not a numerical column.

## Conclusion

In this Tutotial, we discussed how to replace missing values in your dataframe using sklearn’s SimpleImputer class. While you can also replace missing values manually using the fillna() method, the SimpleImputer class makes it relatively easy to handle missing values. If you are working with sklearn, it would be easier to use SimpleImputer together with Pipeline objects.

***