# Impute missing Values using sklearn

Scikit-learn provides several methods for imputing missing values. One commonly used method is the SimpleImputer class


In [2]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns

# The SimpleImputer class from the sklearn.impute module in scikit-learn is used to impute missing values in datasets.
# The SimpleImputer class provides a simple strategy for imputing missing values, such as replacing them with the mean, median, most frequent value, or a constant.

from sklearn.impute import SimpleImputer

In [3]:
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

`Impute` age column using `simpleimputer` from sklearn with `strategy mean` of the `age`

In [5]:
# impute age column using simpleimputer from sklearn with strategy mean of the age

# Instance: An instance is another term used to refer to an object. It specifically emphasizes that the object is created based on a particular class definition.
# Create an instance of SimpleImputer with strategy 'median'
imputer = SimpleImputer(strategy = 'median')

# Fit the imputer to the data and transform the 'age' column
df['age'] = imputer.fit_transform(df[['age']])


In [6]:
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
age              0
survived         0
pclass           0
sex              0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

In [7]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Univariate imputation is for single variable
- Univariate imputation involves filling in `missing values` in a dataset using information from a `single variable`.
- It typically involves computing `summary statistics` (such as `mean, median, or mode`) from the `non-missing values` of the variable and using that value to `replace` the missing values.
- Common methods of univariate imputation include `mean imputation` (replacing missing values with the `mean` of the variable), `median` imputation, `mode` imputation, and `constant` imputation (replacing missing values with a predetermined constant).

## Multivariate Imputation
- Multivariate imputation involves `filling in missing` values in a dataset using information from `multiple variables`.
- Instead of using summary statistics from a single variable, multivariate imputation takes into account the `relationships between variables` to `estimate the missing values more accurately`.
- Common methods of multivariate imputation include `k-nearest` neighbors imputation (replacing missing values with the values of the nearest neighbors in the feature space), `regression` imputation (fitting a regression model to predict missing values based on other variables), and `iterative` imputation methods such as Multiple Imputation by Chained Equations `(MICE)`.

In [14]:
# multivariate imputation
df = sns.load_dataset('titanic')

In [15]:
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

## Multivariate imputation
- The `enable_iterative_imputer` module is used to enable the experimental `IterativeImputer` class in scikit-learn.
- The `IterativeImputer` class is a scikit-learn `estimator` used for imputing missing values in datasets using an iterative approach.
- More `accurate` the univariate

In [10]:
# Import necessary modules
from sklearn.experimental import enable_iterative_imputer  # Enable the IterativeImputer module (available in experimental)
from sklearn.impute import IterativeImputer  # Import the IterativeImputer class

# Create an instance of IterativeImputer
imputer = IterativeImputer(max_iter=10, n_nearest_features=2)
# - max_iter: Maximum number of iterations (imputation rounds) to perform.
# - n_nearest_features: Number of nearest features to use for imputation.

# Fit the imputer to the data and transform the 'age' column
df['age'] = imputer.fit_transform(df[['age']])
# - df[['age']]: Extract the 'age' column as a DataFrame.
# - imputer.fit_transform(df[['age']]): Fit the imputer model to the 'age' column and transform the missing values.
# - df['age'] = ...: Assign the transformed values back to the 'age' column in the original DataFrame 'df'.

In [10]:
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## <b>Forward Fill and Backward Fill
`Forward Fill (ffill):`<br>
Forward fill replaces missing values with the `last known value` in the dataset. It propagates the last observed non-missing value forward along the specified axis.<br>
`Backward Fill (bfill):`<br>
Backward fill replaces missing values with the next known value in the dataset. It propagates the next observed non-missing value backward along the specified axis.

In [17]:
# using ffill to impute values
df['age'] = df['age'].ffill()

In [18]:
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
age              0
survived         0
pclass           0
sex              0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

In [19]:
# using bfill to impute values
df['age'] = df['age'].bfill()

## <b>KNN Imputer

The KNN Imputer, or k-Nearest Neighbors Imputer, is a method used to impute (or fill in) missing values in a dataset based on the values of its nearest neighbors

In [20]:
# Import the KNNImputer class from the sklearn.impute module
from sklearn.impute import KNNImputer

# Create an instance of KNNImputer
# 'n_neighbors' specifies the number of neighbors to use for imputing missing values
# Here, we use 5 neighbors to determine the value of missing entries
imputer = KNNImputer(n_neighbors=5)

# Fit the KNNImputer to the data and transform the 'age' column
# 'df[['age']]' selects the 'age' column from the DataFrame and reshapes it into a 2D array
# The 'fit_transform()' method performs two steps:
# 1. Fit: Learns the pattern from the data by identifying the k-nearest neighbors
# 2. Transform: Imputes the missing values based on the average of the k-nearest neighbors
# The result is a 2D array with missing values imputed
# Assign the result back to the 'age' column in the DataFrame
df['age'] = imputer.fit_transform(df[['age']])

In [21]:
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
age              0
survived         0
pclass           0
sex              0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

## Drop rows having missing values

In [22]:
df.dropna(inplace=True) # drops rows having missing values

In [19]:
df.isnull().sum().sort_values(ascending=False)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 201 entries, 1 to 889
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     201 non-null    int64   
 1   pclass       201 non-null    int64   
 2   sex          201 non-null    object  
 3   age          201 non-null    float64 
 4   sibsp        201 non-null    int64   
 5   parch        201 non-null    int64   
 6   fare         201 non-null    float64 
 7   embarked     201 non-null    object  
 8   class        201 non-null    category
 9   who          201 non-null    object  
 10  adult_male   201 non-null    bool    
 11  deck         201 non-null    category
 12  embark_town  201 non-null    object  
 13  alive        201 non-null    object  
 14  alone        201 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 20.1+ KB
