Five importnat ways for Imputing Missing Values
You can impute missing values using machine learning models. This process is known as data imputation and is commonly used in data preprocessing to handle missing or incomplete data. There are several methods and models you can use, depending on the nature of your data and the missing values:  

**Simple Imputation Techniques:**  

**Mean/Median Imputation:** Replace missing values with the mean or median of the column. Suitable for numerical data.  
**Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. Useful for categorical data.  
**K-Nearest Neighbors (KNN):**This algorithm can be used to impute missing values based on the similarity of rows.  

**Regression Imputation:** Use a regression model to predict the missing values based on other variables in your dataset.
  
**Decision Trees and Random Forests**: These can handle missing values inherently. They can also be used to predict missing values based on the patterns learned from the other data.  

**Advanced Techniques:**  

**Multiple Imputation by Chained Equations (MICE):** This is a more sophisticated technique that models each variable with missing values as a function of other variables in a round-robin fashion.
Deep Learning Methods: Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets.   
**Time Series Specific Method**: For time-series data, you might use techniques like interpolation, forward-fill, or backward-fill.  

It's important to choose the right method based on the type of data, the pattern of missingness (e.g., at random, completely at random, or not at random), and the amount of missing data. Additionally, it's crucial to understand that imputation can introduce bias or affect the distribution of your data, so it should be done with caution and an understanding of the potential implication

**2. K-Nearest Neighbors (KNN)**
KNN is a machine learning algorithm that can be used for imputing missing values. It works by finding the most similar data points to the one with the missing value based on other available features. The missing value is then imputed with the mean or median of the most similar data points.

Let's see how to implement KNN imputation in Python using the Titanic dataset.

In [106]:
#import the library 
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.impute import KNNImputer

In [107]:
#Load the dataset
data=sns.load_dataset("titanic")

#See the data 
data.sample()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
324,0,3,male,,8,2,69.55,S,Third,man,True,,Southampton,no,False


# Mean/Median/Mode Imputation

In [108]:
# make the copy dataset
simple_imputation =data.copy()
# Check the missing values
simple_imputation.isnull().sum().sort_values(ascending=False)

deck           688
age            177
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

In [109]:
import warnings
warnings.filterwarnings("ignore")


simple_imputation['age'] =simple_imputation["age"].fillna(simple_imputation["age"].median())
simple_imputation['embarked'] =simple_imputation["embarked"].fillna(simple_imputation["embarked"].mode()[0])

# Mode Imputation
simple_imputation['embark_town'].fillna(simple_imputation['embark_town'].mode()[0], inplace=True)

In [110]:
simple_imputation.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64

# Forward Fill and Backward Fill
**Description: Fills missing values based on nearby values in a time series or ordered data.**

In [111]:
# # Forward Fill
# df['column'].fillna(method='ffill', inplace=True)

# # Backward Fill
# df['column'].fillna(method='bfill', inplace=True)


# K-Nearest Neighbors (KNN) Imputation
**Description:**   
Uses the K-nearest neighbors algorithm to impute missing values by finding the 'k' most similar instances based on other features.


In [112]:
#Make the copy of dataset
df_imputed = data.copy()

# impute missing values with KNN imputer
from sklearn.impute import KNNImputer

# call the KNN class with number of neighbors = 4
imputer = KNNImputer(n_neighbors=4)

# #impute missing values with KNN imputer
df_imputed["age"]=imputer.fit_transform(df_imputed[["age"]])

# For categorical data first apply encoder then use it

In [113]:
# check the number of missing values in each column

df_imputed.isnull().sum()

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

# Multivariate Imputation by Chained Equations (MICE)
**Description: Imputes missing values iteratively by modeling each variable with missing values as a function of other variables.**  

**. Regression Imputation**  
Regression imputation uses a regression model to predict the missing values based on other variables in the dataset. It works well for both categorical and numerical data.

In [114]:
df1=sns.load_dataset("titanic")

In [115]:
# impute missing values with regression imputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# call the IterativeImputer class with max_iter = 10
imputer = IterativeImputer(max_iter=10)

#impute missing values with regression imputer
df1['age'] = imputer.fit_transform(df1[['age']])

# check the number of missing values in each column
df1.isnull().sum().sort_values(ascending=False)


deck           688
embarked         2
embark_town      2
survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
class            0
who              0
adult_male       0
alive            0
alone            0
dtype: int64

**5.2. Deep Learning Methods**
Neural networks, especially autoencoders, can be effective in imputing missing values in complex datasets. Deep learning methods, particularly neural networks like autoencoders, offer a powerful approach for imputing missing values in complex datasets. These methods are especially useful when the data has intricate, non-linear relationships that traditional statistical methods might not capture effectively.

**Understanding Autoencoders for Imputation:**
What is an Autoencoder?

An autoencoder is a type of neural network that is trained to copy its input to its output.
It has a hidden layer that describes a code used to represent the input.
The network may be viewed as consisting of two parts: an encoder function, which compresses the input into a latent-space representation, and a decoder function, which reconstructs the input from the latent space.
How Autoencoders Work for Imputation:

The key idea is to train the autoencoder to ignore the noise (missing values) in the input data.
During training, inputs with missing values are presented, and the network learns to predict the missing values in a way that minimizes reconstruction error for known parts of the data.
This results in the network learning a robust representation of the data, enabling it to make reasonable guesses about missing values.
Advantages of Using Autoencoders:

**Handling Complex Patterns:**   They can capture non-linear relationships in the data, which is particularly useful for complex datasets.    
**Scalability:**  They can handle large-scale datasets efficiently.  
**Flexibility:**  They can be adapted to different types of data (e.g., images, text, time-series).  
**Implementation Considerations:**  

**Data Preprocessing:**  Data should be normalized or standardized before feeding it into an autoencoder.  
**Network Architecture:**   The choice of architecture (number of layers, type of layers, etc.) depends on the complexity of the data.
**Training Process:** It might involve techniques like dropout or noise addition to improve the model's ability to handle missing data.
Example Use-Cases:

**Image Data:**  Filling in missing pixels or reconstructing corrupted images.
**Time-Series Data:**  Imputing missing values in sequences like stock prices or weather data.
**Tabular Data**: Handling missing entries in datasets used for machine learning.

In [116]:

import seaborn as sns
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Load the Titanic dataset
df_titanic = sns.load_dataset('titanic')

# Selecting relevant features for simplicity
df_titanic = df_titanic[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']]

# Preprocessing
# Separate features and target
X = df_titanic.drop('survived', axis=1)
y = df_titanic['survived']

# Handling missing values and categorical variables
numeric_features = ['age', 'fare', 'sibsp', 'parch']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', MinMaxScaler())])

categorical_features = ['pclass', 'sex', 'embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Preprocessing the dataset
X_preprocessed = preprocessor.fit_transform(X)

# Splitting the dataset (we'll use the train set to train the autoencoder)
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

# Define the autoencoder architecture
input_dim = X_train.shape[1]
encoding_dim = 32

input_layer = Input(shape=(input_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_layer)
decoded = Dense(input_dim, activation='sigmoid')(encoded)

autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Train the autoencoder
autoencoder.fit(X_train, X_train, epochs=50, batch_size=64, shuffle=True, validation_split=0.2)

# Using the autoencoder for imputation on test set
X_test_imputed = autoencoder.predict(X_test)

# Note: Transforming imputed data back to original feature space is complex and requires reversing the preprocessing steps. 
# This is often not straightforward, especially for one-hot encoded features.

Epoch 1/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - loss: 0.2235 - val_loss: 0.2156
Epoch 2/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 0.2111 - val_loss: 0.2027
Epoch 3/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 0.1978 - val_loss: 0.1897
Epoch 4/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 0.1847 - val_loss: 0.1767
Epoch 5/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 0.1722 - val_loss: 0.1635
Epoch 6/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 0.1598 - val_loss: 0.1507
Epoch 7/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 0.1471 - val_loss: 0.1383
Epoch 8/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 0.1347 - val_loss: 0.1268
Epoch 9/50
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[