# Imputing Missing Values Using Mean, Median and Mode

Mean/median imputation replaces missing values with the mean or median of the column. This is a simple and effective method, but it has some limitations. For example, it reduces variance in the dataset, and it can lead to biased estimates if the missing values are not missing at random.

1. **`Simple Imputation Techniques:`** 
   - **Mean/Median Imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.
   - **Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.

In [1]:
# import the liberaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the Titanic dataset
data = sns.load_dataset('titanic')
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
# Check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We can see that the `age` column has 177 missing values. Let's replace these missing values with the mean of the column:

## Mean Imputation

In [4]:
# Impute missing values with mean
data['age'] = data['age'].fillna(data['age'].mean())

# Check the number of missing values in each column
data.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

We can see that the missing values in the `age` column have been replaced with the mean of the column.

## Median Imputation

Let's load the dataset and replace the missing values in the `age` column with the median of the column:

In [5]:
df = sns.load_dataset('titanic')
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [6]:
# Impute missing values with median
df['age'] = df['age'].fillna(df['age'].median())

# Check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

## Mode Imputation

Mode imputation replaces missing values with the mode (most frequent value) of the column. This is useful for imputing categorical columns, such as `Embarked` and `embark_town` in the Titanic dataset.

Let's see how to implement mode imputation in Python using the Titanic dataset.

In [7]:
# Load the dataset
df = sns.load_dataset('titanic')

# Check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [8]:
# Impute missing values with mode
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Check the number of missing values in each column
df.isnull().sum().sort_values(ascending=False)

deck           688
age            177
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
embark_town      0
alive            0
alone            0
dtype: int64

We can see that the missing values in the `embark_town` column and `embarked` column have been replaced with the mode of the column.