## Why we must handle missing values?

   * Missing data can introduce bias and distort patterns in the data.
   * Missing data can lead to a lack of precision in the statistical analysis.
   * Many machine learning algorithms fail if the dataset contains missing values. However, algorithms like K-nearest and Naive      Bayes support data with missing values.
   * You may end up building a biased machine learning model, leading to incorrect results if the missing values are not handled      properly. 

## What are the ways that can be handled the data?

  * Removing entire rows with any missing values. _This can lead to loss of information and potential bias if the missing data       is not random._
  
  
  * Imputation:
  
      1- Simple Imputation: Filling missing values with the mean, median, or mode of the available data. Simple but can         distort the distribution.
    
      2- Iterative Imputation: Using regression models to predict missing values based on other variables' values.
    
      3- K-Nearest Neighbors (KNN) Imputation: Estimating missing values by averaging values of the nearest neighbors in the dataset.


In [138]:
import pandas as pd
import numpy as np

data = {'Column1': [1, 2, np.nan, 4, 5],
        'Column2': [np.nan, 12, 13, np.nan, 15],
        'Column3': [21, 22, 23, np.nan, np.nan]}

df = pd.DataFrame(data)

df

Unnamed: 0,Column1,Column2,Column3
0,1.0,,21.0
1,2.0,12.0,22.0
2,,13.0,23.0
3,4.0,,
4,5.0,15.0,


### Simple Imputation

In [140]:
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt

# Create an instance of SimpleImputer with the mean strategy
mean_imputer = SimpleImputer(strategy='mean')

# Fill missing values using the mean strategy
df_mean_imputed = pd.DataFrame(mean_imputer.fit_transform(df), columns=df.columns)

df_median_imputed.head()

Unnamed: 0,Column1,Column2,Column3
0,1.0,10.0,60.0
1,2.0,35.0,60.0
2,3.0,30.0,60.0
3,4.0,40.0,40.0
4,5.0,50.0,90.0


In [141]:
# Create an instance of SimpleImputer with the median strategy
median_imputer = SimpleImputer(strategy='median')

# Fill missing values using the median strategy
df_median_imputed = pd.DataFrame(median_imputer.fit_transform(df), columns=df.columns)

df_median_imputed.head()

Unnamed: 0,Column1,Column2,Column3
0,1.0,13.0,21.0
1,2.0,12.0,22.0
2,3.0,13.0,23.0
3,4.0,13.0,22.0
4,5.0,15.0,22.0


### Iterative Imputation

In [142]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Initialize the IterativeImputer
imputer = IterativeImputer()

# Fit and transform the data using IterativeImputer
Iterative_imputer = imputer.fit_transform(df)

# Create a new DataFrame with imputed values
df_Iterative_imputer = pd.DataFrame(Iterative_imputer, columns=df.columns)

df_Iterative_imputer.head()

Unnamed: 0,Column1,Column2,Column3
0,1.0,12.964993,21.0
1,2.0,12.0,22.0
2,5.618548,13.0,23.0
3,4.0,13.314871,22.108407
4,5.0,15.0,21.563525


### KNN Imputation

In [143]:
from sklearn.impute import KNNImputer

# Initialize the KnnImputer
imputer = KNNImputer(n_neighbors=2)

# Fit and transform the data using KnnImputer
knn_imputed = imputer.fit_transform(df)

# Create a new DataFrame with imputed values
df_knn_imputed = pd.DataFrame(knn_imputed, columns=df.columns)

df_knn_imputed.head()

Unnamed: 0,Column1,Column2,Column3
0,1.0,12.5,21.0
1,2.0,12.0,22.0
2,1.5,13.0,23.0
3,4.0,13.5,21.5
4,5.0,15.0,22.5


In [144]:
df

Unnamed: 0,Column1,Column2,Column3
0,1.0,,21.0
1,2.0,12.0,22.0
2,,13.0,23.0
3,4.0,,
4,5.0,15.0,


## How do you choose the imputation technique?

   * SimpleImputer is best for cases where there are only a small number of missing observations, and where                          missingness in one feature is not affected by other features. 
   
        *  Mean imputation is often used when the missing values are numerical and the distribution of the variable is                       approximately normal.
          
        * Median imputation is preferred when the distribution is skewed, as the median is less sensitive to outliers than                the mean.
          
        * Mode imputation is suitable for categorical variables or numerical variables with a small number of unique values.
        

   * Both KNN imputer and iterative imputer are considered to be valid imputation methods in data science. The choice between        them is generally considered to be situational and dependent on the specific characteristics of the data and the goals of        the analysis.
   
       * KNN imputer uses the K-nearest neighbors algorithm to impute missing values. It is particularly useful for                    datasets with a large number of missing values and a small number of features. It tends to work well when the data has          some sort of similar structure.
         
       * Iterative imputer, on the other hand, uses a more complex algorithm that estimates missing values by                             modeling each feature with missing values as a function of other features in the dataset. This method tends to                   work well when the data has complex relationships among features.
       
         
 * In general, it is recommended to try both methods and compare their performance to see which one is more appropriate for the     specific dataset and analysis.

![](pic.png)