# **Simple Imputer**

Simple Imputer is a simple imputer that replaces missing values with a constant value. 

It can be used to replace missing values with the mean, median, or a constant value. 

It provides basic strategies for replacing missing data (represented by np.nan by default) with estimated values.

It is a fundamental tool in data preprocessing for machine learning, as many algorithms cannot handle missing data directly.


In [18]:
import pandas as pd
import numpy as np

In [19]:
miles = pd.DataFrame({'farthest_run_mi': [50, 62, np.nan, 100, 26, 13, 31, 50]})
miles

Unnamed: 0,farthest_run_mi
0,50.0
1,62.0
2,
3,100.0
4,26.0
5,13.0
6,31.0
7,50.0


In [20]:
miles.isna().sum()

farthest_run_mi    1
dtype: int64

In [21]:
from sklearn.impute import SimpleImputer

In [22]:
# Strategy means that missing values will be replaced with the mean of the column
imp_mean = SimpleImputer(strategy='mean') 
imp_mean.fit_transform(miles)

array([[ 50.        ],
       [ 62.        ],
       [ 47.42857143],
       [100.        ],
       [ 26.        ],
       [ 13.        ],
       [ 31.        ],
       [ 50.        ]])

In [23]:
# Strategy means that missing values will be replaced with the median of the respective column.
imp_median = SimpleImputer(strategy='median')
imp_median.fit_transform(miles)

array([[ 50.],
       [ 62.],
       [ 50.],
       [100.],
       [ 26.],
       [ 13.],
       [ 31.],
       [ 50.]])

In [24]:
# Strategy means that missing values will be replaced with the most frequent value in the column.
imp_mode = SimpleImputer(strategy='most_frequent')
imp_mode.fit_transform(miles)

array([[ 50.],
       [ 62.],
       [ 50.],
       [100.],
       [ 26.],
       [ 13.],
       [ 31.],
       [ 50.]])

In [25]:
# Strategy means that missing values will be replaced with a constant value, e.g., 0.
# This is useful when you want to fill missing values with a specific value.
imp_constant = SimpleImputer(strategy='constant', fill_value=13)
imp_constant.fit_transform(miles)

array([[ 50.],
       [ 62.],
       [ 13.],
       [100.],
       [ 26.],
       [ 13.],
       [ 31.],
       [ 50.]])

In [26]:
names = pd.DataFrame({'name': ['ryan', 'nolan', 'honus', 'wagner', np.nan, 'ruth']})

In [27]:
imp_constant_category = SimpleImputer(strategy='constant', fill_value='unknown')
imp_constant_category.fit_transform(names)

array([['ryan'],
       ['nolan'],
       ['honus'],
       ['wagner'],
       ['unknown'],
       ['ruth']], dtype=object)

In [28]:
# add_indicator=True adds a column indicating which values were imputed
imp_mean_marked = SimpleImputer(strategy='mean', add_indicator=True)
imp_mean_marked.fit_transform(miles)

array([[ 50.        ,   0.        ],
       [ 62.        ,   0.        ],
       [ 47.42857143,   1.        ],
       [100.        ,   0.        ],
       [ 26.        ,   0.        ],
       [ 13.        ,   0.        ],
       [ 31.        ,   0.        ],
       [ 50.        ,   0.        ]])

In [None]:
from sklearn.compose import make_column_transformer

# Example of using make_column_transformer to apply different imputers to different columns
column_transformer = make_column_transformer(
    (imp_constant_category, ['name']),
    (imp_mean, ['farthest_run_mi']),
    remainder='drop'  # Drop other columns not specified
)
# This code applies different imputers to different columns of a DataFrame using `make_column_transformer`.
# The `imp_constant_category` is applied to the 'name' column, while `imp_mean` is applied to the 'farthest_run_mi' column.
# The `remainder='drop'` argument ensures that any columns not specified in the transformer

In [30]:
column_transformer.set_output(transform='pandas')  # Set output to pandas DataFrame
