# ** Imputation of Null Values in Python- Machine Learning**
# ** Author** *Hafiza Anam Masood*  
# *Email: drhafizaanam@gmail.com*

Imputatipn: A better strategy is to impute the missing values, i.e., to infer them from the known part of the data.


Univariate vs. Multivariate Imputation
One type of imputation algorithm is univariate, which imputes values in the i-th feature dimension using only non-missing values in that feature dimension (e.g. impute.SimpleImputer). By contrast, multivariate imputation algorithms use the entire set of available feature dimensions to estimate the missing values (e.g. impute.IterativeImputer). 
See further: https://scikit-learn.org/stable/modules/impute.html

# Missing Values Imputation Method in Scikit-Learn Library
-1 SimpleImputer
 -2 KNNImputer
 -3 IterativeImputer
 -4 MissingIndicator (this shows to indicate the missing values)

# Simple Imputer

In [4]:
pip install scikit-learn 

Collecting scikit-learn
  Obtaining dependency information for scikit-learn from https://files.pythonhosted.org/packages/77/85/bff3a1e818ec6aa3dd466ff4f4b0a727db9fdb41f2e849747ad902ddbe95/scikit_learn-1.3.0-cp311-cp311-win_amd64.whl.metadata
  Using cached scikit_learn-1.3.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting joblib>=1.1.1 (from scikit-learn)
  Obtaining dependency information for joblib>=1.1.1 from https://files.pythonhosted.org/packages/28/08/9dcdaa5aac4634e4c23af26d92121f7ce445c630efa0d3037881ae2407fb/joblib-1.3.1-py3-none-any.whl.metadata
  Using cached joblib-1.3.1-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Obtaining dependency information for threadpoolctl>=2.0.0 from https://files.pythonhosted.org/packages/81/12/fd4dea011af9d69e1cad05c75f3f7202cdcbeac9b712eea58ca779a72865/threadpoolctl-3.2.0-py3-none-any.whl.metadata
  Using cached threadpoolctl-3.2.0-py3-none-any.whl.metadata (10.0 kB)
Using cached scikit_learn-1.

In [16]:
# Import libraries

import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer 
    

In [14]:
# Load the Titanic dataset

titanic_data = sns.load_dataset("titanic")

# Select the relevant Columns containing missing values

columns_with_nulls = ["age", "fare"]

# Create a new DataFrame with the selected columns
data = titanic_data[columns_with_nulls].copy()

# Print the number of missing values in the selected columns
print("Missing Values in the data:\n", data.isnull().sum())




Missing Values in the data:
 age     177
fare      0
dtype: int64


In [15]:
# Method 1 : SimpleImputer wih mean strategy
mean_imputer = SimpleImputer(strategy= 'mean')
mean_imputed_data = mean_imputer.fit_transform(data)


In [10]:
# Method 2: SimpleImputer with median Strategy
median_imputer = SimpleImputer(strategy= 'median')
median_imputed_data = median_imputer.fit_transform(data)

In [20]:
# Method 3: SimpleImputer with mode Strategy
mode_imputer = SimpleImputer(strategy= 'mode')
most_frequent_imputed_data = most_frequent_imputer.fit_transform(data)

In [22]:
# Replace null Values in the original DataFrame with new Columns

# Run the cells that define mean_imputed_data, median_imputed_data, and mode_imputed_data first
titanic_data[['age_mean', 'fare_mean']] = mean_imputed_data
titanic_data[['age_median', 'fare_median']] = median_imputed_data
titanic_data[['age_most_frequent', 'fare_most_frequent']] = most_frequent_imputed_data

# Print a Separation Line

print("====================================================================================\n")

# print the missing values

print("Missing values in the imputed_data:\n", titanic_data.isnull().sum()) 

# print the modified DataFrame
print(titanic_data.head())



Missing values in the imputed_data:
 survived                0
pclass                  0
sex                     0
age                   177
sibsp                   0
parch                   0
fare                    0
embarked                2
class                   0
who                     0
adult_male              0
deck                  688
embark_town             2
alive                   0
alone                   0
age_mean                0
fare_mean               0
age_median              0
fare_median             0
age_most_frequent       0
fare_most_frequent      0
dtype: int64
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    

# *Iterative_Imputer*

In [24]:
import pandas as pd
import seaborn as sns
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [35]:
# Load the Titanic dataset

titanic_data = sns.load_dataset("titanic")

# Select the relevant Columns containing missing values

columns_with_nulls = ["age", "fare"]

# Create a new DataFrame with the selected columns
data = titanic_data[columns_with_nulls].copy()

# Iterative Imputer

Iterative_Imputer = IterativeImputer()
Iterative_Imputed_data = Iterative_Imputer.fit_transform(data)

# Replace null Values in the original DataFrame with new Columns

titanic_data[['age_iterative', 'fare_iterative']] = Iterative_Imputed_data

# Print a Separation Line

print("====================================================================================\n")

# print the missing values

print("Missing values in the imputed_data:\n", titanic_data.isnull().sum())






Missing values in the imputed_data:
 survived            0
pclass              0
sex                 0
age               177
sibsp               0
parch               0
fare                0
embarked            2
class               0
who                 0
adult_male          0
deck              688
embark_town         2
alive               0
alone               0
age_iterative       0
fare_iterative      0
dtype: int64


 # KNN_Imputer


In [26]:
import pandas as pd
import seaborn as sns
from sklearn.impute import KNNImputer


In [29]:
# Load the Titanic dataset

titanic_data = sns.load_dataset("titanic")

# Select the relevant Columns containing missing values

columns_with_nulls = ["age", "fare"]

# Create a new DataFrame with the selected columns
data = titanic_data[columns_with_nulls].copy()

#KNN Imputer

knn_imputer = KNNImputer(n_neighbors=5)

# Impute missing values using KNNImputer
knn_imputed_data = knn_imputer.fit_transform(data)

# Replace null Values in the original DataFrame with new Columns

# Run the null values in the original DataFrame with new columns
titanic_data[['age_knn', 'fare_knn']] = knn_imputed_data
titanic_data[['age_knn', 'fare_knn']] = knn_imputed_data

# Print a Separation Line

print("====================================================================================\n")

# print the missing values

print("Missing values in the imputed_data:\n", titanic_data.isnull().sum())


Missing values in the imputed_data:
 survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
age_knn          0
fare_knn         0
dtype: int64


# MissingIndicator

In [30]:
import pandas as pd
import seaborn as sns
from sklearn.impute import MissingIndicator

In [32]:
# Load the Titanic dataset

titanic_data = sns.load_dataset("titanic")

# Select the relevant Columns containing missing values

columns_with_nulls = ["age", "fare"]

# Create a new DataFrame with the selected columns
data = titanic_data[columns_with_nulls].copy()

#Misssing value indicators

indicator = MissingIndicator(features="all")
missing_indicators = indicator.fit_transform(data)



# Add missing value indicators to the DataFrame
titanic_data [['age_missing', 'fare_missing']] = missing_indicators


# Print a Separation Line

print("====================================================================================\n")

# print the missing values

print("Missing values in the imputed_data:\n", titanic_data.isnull().sum())

print(titanic_data.head())


Missing values in the imputed_data:
 survived          0
pclass            0
sex               0
age             177
sibsp             0
parch             0
fare              0
embarked          2
class             0
who               0
adult_male        0
deck            688
embark_town       2
alive             0
alone             0
age_missing       0
fare_missing      0
dtype: int64
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  age_missing  fare_missing  
0    man        True  NaN  Southampton    no  False        False       

# All together

In [33]:
# Import libraries

import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer , KNNImputer, IterativeImputer
from sklearn.impute import MissingIndicator



# Read the data
# Load the Titanic dataset

titanic_data = sns.load_dataset("titanic")

# Select the relevant Columns containing missing values

columns_with_nulls = ["age", "fare"]

# Create a new DataFrame with the selected columns
data = titanic_data[columns_with_nulls].copy()


# Method 1 : SimpleImputer wih mean strategy
mean_imputer = SimpleImputer(strategy= 'mean')
mean_imputed_data = mean_imputer.fit_transform(data)


# Method 2: SimpleImputer with median Strategy
median_imputer = SimpleImputer(strategy= 'median')
median_imputed_data = median_imputer.fit_transform(data)


# Method 3: SimpleImputer with mode Strategy
mode_imputer = SimpleImputer(strategy= 'mode')
most_frequent_imputed_data = most_frequent_imputer.fit_transform(data)

# Method 4: KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)

# Impute missing values using KNNImputer
knn_imputed_data = knn_imputer.fit_transform(data)


#Misssing value indicators

indicator = MissingIndicator(features="all")
missing_indicators = indicator.fit_transform(data)

# Replace null values in the Original Data Frame

# Replace null Values in the original DataFrame with new Columns

# Run the cells that define mean_imputed_data, median_imputed_data, and mode_imputed_data first
titanic_data[['age_mean', 'fare_mean']] = mean_imputed_data
titanic_data[['age_median', 'fare_median']] = median_imputed_data
titanic_data[['age_most_frequent', 'fare_most_frequent']] = most_frequent_imputed_data
titanic_data[['age_knn', 'fare_knn']] = knn_imputed_data
titanic_data [['age_missing', 'fare_missing']] = missing_indicators

# Print a Separation Line

print("====================================================================================\n")

# print the missing values

print("Missing values in the imputed_data:\n", titanic_data.isnull().sum()) 

# print the modified DataFrame
print(titanic_data.head())



    


Missing values in the imputed_data:
 survived                0
pclass                  0
sex                     0
age                   177
sibsp                   0
parch                   0
fare                    0
embarked                2
class                   0
who                     0
adult_male              0
deck                  688
embark_town             2
alive                   0
alone                   0
age_mean                0
fare_mean               0
age_median              0
fare_median             0
age_most_frequent       0
fare_most_frequent      0
age_knn                 0
fare_knn                0
age_missing             0
fare_missing            0
dtype: int64
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Thir

In [34]:
titanic_data.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,age_mean,fare_mean,age_median,fare_median,age_most_frequent,fare_most_frequent,age_knn,fare_knn
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,29.699118,32.204208,29.361582,32.204208,29.699118,32.204208,29.638278,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,13.002015,49.693429,13.019697,49.693429,13.002015,49.693429,13.357812,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0,0.42,0.0,0.42,0.0,0.42,0.0,0.42,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,22.0,7.9104,22.0,7.9104,22.0,7.9104,22.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542,29.699118,14.4542,28.0,14.4542,29.699118,14.4542,28.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0,35.0,31.0,35.0,31.0,35.0,31.0,36.4,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,80.0,512.3292,80.0,512.3292,80.0,512.3292,80.0,512.3292
