<a href="https://colab.research.google.com/github/Kiran-Pokhrel-91/Data/blob/main/Imputing_missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Six Important Ways of Imputing Missing Values <br>
You can impute missing values using machine learning model which is known as data imputaion and is commonly used in data preprocessing to handle missing or incomoplete data.Some methods are as follows:- <br>
1. **Simple Imputatuin technique**
    - Mean / Median Imputation
    - Mode Imputation
2. **K-Nearest Neighbors(KNN)**
3. **Regression Imputation**
4. **Decission Tree and Random Forest**
5. **Advance Technique:**
    - Multiple Imputation by Chain Equation (MICE)
    - Deep Learning Methods
6. **Time-Series Specific Method** <br>

It is important to choose right method based on tyopes of data, pattern of missingness and amount of missing data.

# 1. Simple Imputatuion: Mean, Median, and Mode Imputation

When working with real-world datasets, it's common to encounter missing values. These need to be handled appropriately to avoid skewing analysis or model performance.

## 1.1 Mean and Median Imputation

Mean and median imputation are numerical imputation strategies used to fill in missing values in continuous features.

- **Mean imputation** replaces missing values with the average of the available values.
- **Median imputation** replaces missing values with the median (middle value) of the available values.

These methods are simple and often effective, but they can:
- Underestimate the variability (variance) in the dataset.
- Introduce bias if data are **not missing at random (NMAR)**.


## 1.2 Mode Imputation

Mode imputation is typically used for categorical features. It replaces missing values with the most frequent (mode) value in the column.

Let's look at an example using the `embarked` and `embark_town` columns.
      
**Remember:** These are basic imputation strategies. For more robust handling of missing data, consider advanced techniques like:
- K-Nearest Neighbors (KNN) imputation
- Multivariate imputation (e.g., MICE)
- Using machine learning models to predict missing values


In [None]:
import pandas as pd
import seaborn as sns

data = sns.load_dataset('titanic')
print(data.isnull().sum().sort_values(ascending=False))

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64


In [None]:
data = (
    data.assign(
        age = data['age'].fillna(data['age'].mean()),
        embarked = data['embarked'].fillna(data['embarked'].mode()[0]),
        embark_town = data['embark_town'].fillna(data['embark_town'].mode()[0])
    )
    .drop(columns='deck')
)


In [None]:
print(data.isnull().sum().sort_values(ascending=False))

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


# 2. KNN Imputation for Handling Missing Values

---

While simple methods like mean, median, and mode imputation are useful, they have limitations—particularly when data are **not missing at random** or have complex relationships.

To address this, we can use a **machine learning approach**: **K-Nearest Neighbors (KNN) imputation**.

---

## What is KNN Imputation?

KNN imputation finds the most similar observations (neighbors) to the sample with the missing value, based on other available features. The missing value is then imputed with:

- The **mean** (for continuous variables)
- Or **mode** (for categorical variables) of the neighbors

KNN can preserve more complex patterns in the data compared to simpler methods.

---

## Summary

KNN Imputation is a powerful alternative to simple imputation methods, especially when:

- Relationships between features are nonlinear
- You want to preserve multivariate structure
- Data has missing values in more than one column

Next, you could explore **MICE (Multivariate Imputation by Chained Equations)** for even more sophisticated imputation!



In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
import pandas as pd

def knn_impute(df, target_col, numeric_cols, n_neighbors=5):
    """
    Imputes missing values in the specified target column using KNN imputation.

    Parameters:
    - df (pd.DataFrame): Input DataFrame.
    - target_col (str): Column to impute (e.g., 'age').
    - numeric_cols (list): List of numeric columns to use for imputation.
    - n_neighbors (int): Number of neighbors to use for KNN imputer (default is 5).

    Returns:
    - df (pd.DataFrame): DataFrame with imputed target column.
    """

    # Copy the DataFrame to avoid modifying the original
    df_copy = df.copy()

    # Extract subset of data to scale and impute
    df_knn = df_copy[numeric_cols]

    # Step 1: Scale the features
    scaler = StandardScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(df_knn), columns=df_knn.columns)

    # Step 2: Apply KNN Imputer
    imputer = KNNImputer(n_neighbors=n_neighbors)
    df_imputed_scaled = pd.DataFrame(imputer.fit_transform(df_scaled), columns=df_knn.columns)

    # Step 3: Inverse scale to original values
    df_imputed = pd.DataFrame(scaler.inverse_transform(df_imputed_scaled), columns=df_knn.columns)

    # Step 4: Replace original column with imputed values
    df_copy[target_col] = df_imputed[target_col]

    return df_copy


In [None]:
import seaborn as sns

# Load Titanic dataset
df = sns.load_dataset('titanic')
df.drop(columns='deck', inplace=True)

# Impute 'age' using KNN with relevant numeric features
numeric_features = ['age', 'fare', 'pclass', 'sibsp', 'parch']
df = knn_impute(df, target_col='age', numeric_cols=numeric_features)

# Check missing values
print(df.isnull().sum())
df.head()


survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


# 3. Regression Imputation: A Simple Machine Learning Approach for Missing Data

---

## What is Regression Imputation?

Regression imputation uses a **regression model** to predict missing values based on other observed features in the dataset.

- For **numerical variables**, a regression model (e.g., linear regression) is trained on samples with observed values.  
- The trained model then **predicts missing values** by leveraging correlations with other variables.

This method works well when there is a strong relationship between the variable with missing values and other features.

---

## Why Use Regression Imputation?

Regression imputation can:  
- Capture linear or even nonlinear relationships between variables (if you use nonlinear regression).  
- Preserve more information than simple imputation, leading to more realistic and accurate imputations.  
- Handle missing values even when data are **not missing completely at random**.

---

## Summary

Regression imputation is a powerful alternative to mean/median imputation, especially when:  
- There are meaningful dependencies between variables.  
- You want to leverage available information to make more accurate imputations.  
- Missing data occurs in important continuous variables.



In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

def linear_regression_impute_age(df, features, encode_col='sex'):
    """
    Impute missing 'age' values using Linear Regression based on selected features.

    Parameters:
    - df: pandas DataFrame (should include 'age' and the selected features)
    - features: list of column names to use as predictors
    - encode_col: column name to manually encode (default is 'sex')

    Returns:
    - df with imputed 'age' values
    """

    # Copy to avoid modifying original DataFrame
    df = df.copy()

    # Encode 'sex' or any categorical column manually (for simplicity)
    if encode_col and df[encode_col].dtype == 'object':
        df[encode_col] = df[encode_col].map({'male': 0, 'female': 1})

    # Split the dataset into known and unknown age
    df_known_age = df[df['age'].notnull()]
    df_missing_age = df[df['age'].isnull()]

    # Prepare training and testing data
    X_train = df_known_age[features]
    y_train = df_known_age['age']
    X_test = df_missing_age[features]

    # Train linear regression model
    model = LinearRegression()
    model.fit(X_train, y_train)

    # Predict and fill missing ages
    df.loc[df['age'].isnull(), 'age'] = model.predict(X_test)

    return df



In [None]:
import seaborn as sns

# Load data
df = sns.load_dataset('titanic')

# Drop 'deck' due to too many missing values
df.drop(columns='deck', inplace=True)

# Select features for age prediction
selected_features = ['pclass', 'sex', 'sibsp', 'parch', 'fare']

# Apply the imputation function
df = linear_regression_impute_age(df, features=selected_features)

# Final check
print(df.isnull().sum())
df.head()

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
embark_town    2
alive          0
alone          0
dtype: int64


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,0,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,1,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,1,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,1,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,0,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


# 4. Random Forests for Imputing Missing Values

---

## What is Random Forest Imputation?

**Random Forest Imputation** uses an ensemble of decision trees (a random forest) to predict and fill in missing values in a dataset.

- For **numerical variables**, the random forest regressor is used.
- For **categorical variables**, the random forest classifier is used.
- The model is trained using the rows where the target feature is **not missing**, and it predicts values for rows where it **is missing**.

Unlike simpler methods, random forests can capture **nonlinear relationships and interactions** between features.

---

## Why Use Random Forests for Imputation?

Random Forest imputation is **robust and powerful** because it:

- Handles **nonlinear dependencies** between variables.
- Can model **interactions** between features automatically.
- Works for both **numerical** and **categorical** data.
- Can achieve **higher accuracy** than mean/median or regression imputation in many cases.
- Is **non-parametric** — no assumption of linearity or distribution of data.

---

## Limitations

- **Computationally intensive**: Especially for large datasets or with many missing values.
- **May overfit**: If the number of trees is too high and proper regularization isn't applied.
- **Black-box model**: Random forests are less interpretable compared to simpler models like linear regression.
- **Slower training**: Compared to simpler imputation methods like mean/median imputation.

---

## Summary

Random Forest Imputation is a **flexible and accurate** method for handling missing values. It is especially useful when:

- There are **complex or nonlinear relationships** among features.
- You need to impute both **numerical and categorical** data.
- You want to **leverage patterns** in the data instead of applying generic values.

Although it is more computationally expensive, it often yields **better performance** than traditional imputation techniques.




In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

def random_forest_impute(df, target_col, label_encode_cols):
    """
    Imputes missing values in a target column using a Random Forest Regressor.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing missing values.
    - target_col (str): The name of the column to impute (e.g., 'age').
    - label_encode_cols (list): List of categorical columns to encode before modeling.

    Returns:
    - df (pd.DataFrame): The DataFrame with missing values in the target column imputed,
                         and categorical columns restored to their original form.
    """

    df = df.copy()  # Avoid in-place modification

    # Encode categorical variables
    encoders = {}
    for col in label_encode_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col].astype(str))
        encoders[col] = le

    # Split the data
    df_missing = df[df[target_col].isna()]
    df_not_missing = df[df[target_col].notna()]

    # Ensure we have data to predict
    if df_missing.empty:
        print(f"No missing values in '{target_col}' to impute.")
        # Inverse-transform and return
        for col in label_encode_cols:
            df[col] = encoders[col].inverse_transform(df[col])
        return df

    # Separate features and target
    X = df_not_missing.drop([target_col], axis=1)
    y = df_not_missing[target_col]

    # Align test data (drop same columns)
    X_missing = df_missing.drop([target_col], axis=1)

    # Train the model
    rf = RandomForestRegressor(n_estimators=100, random_state=42)
    rf.fit(X, y)

    # Predict and fill missing values
    predicted_values = rf.predict(X_missing)
    df.loc[df[target_col].isna(), target_col] = predicted_values

    # Inverse transform label-encoded columns
    for col in label_encode_cols:
        df[col] = encoders[col].inverse_transform(df[col])

    return df


In [None]:
import seaborn as sns

# Load data
df = sns.load_dataset('titanic')
df.drop(columns='deck', inplace=True)

# Apply the improved function
df = random_forest_impute(df, 'age', ['sex', 'embarked', 'who', 'class', 'embark_town', 'alive'])

# Check result
print("Missing values:\n", df.isnull().sum())
df.head()


Missing values:
 survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


# 5. Advanced Techniques

## 5.1 Multiple Imputation by Chained Equations (MICE)

---

## What is MICE?

**Multiple Imputation by Chained Equations (MICE)** is a powerful and flexible technique for handling missing data.

- It models each variable with missing values **as a function of other variables** in the dataset.
- This modeling and imputation process is repeated in a **round-robin** fashion.
- MICE can handle **both numerical and categorical** variables.
- It creates **multiple imputed datasets**, which can be used for robust statistical inference.

> MICE assumes that the data are **missing at random (MAR)** — the probability of missingness can be predicted using observed data.

---

## Why Use MICE?

MICE is widely used because it:

- Accounts for **uncertainty** in missing data by generating multiple possible values.
- Produces more **reliable and valid estimates** than single imputation techniques.
- Preserves **relationships among variables** better than simple methods.
- Works well in **complex datasets** with various types of features.

---

## Limitations

- **Computationally intensive**: The iterative nature of MICE can be slow for large datasets or many variables.
- **Assumes data are Missing at Random (MAR)**: If the data are Missing Not at Random (MNAR), the imputations may be biased.
- **Sensitive to model selection**: The imputation quality depends heavily on the estimator used (e.g., BayesianRidge, RandomForest).
- **More complex to implement and explain**: Compared to simpler techniques like mean or median imputation.

---

## Summary

Multiple Imputation by Chained Equations (MICE) is a **powerful and flexible** method for handling missing data. It:

- Models each variable with missing values based on all other variables.
- Uses an **iterative round-robin** approach for imputation.
- Supports both **numerical and categorical** features.
- Produces more **statistically sound** results by accounting for uncertainty in the imputed values.
- Is well-suited for **complex datasets** where preserving relationships between variables is important.

Use MICE when accuracy and robustness are more important than simplicity or speed.




In [None]:
def preprocess_data_with_mice(df, categorical_cols=None, impute_cols=None, max_iter=10):
    """
    Preprocesses any DataFrame by encoding categorical columns, imputing missing values
    using IterativeImputer (MICE), and decoding categorical columns back to original labels.

    Parameters:
        df (pd.DataFrame): Input dataset.
        categorical_cols (list): List of categorical column names to encode and decode.
        impute_cols (list): List of column names to impute missing values.
        max_iter (int): Maximum number of iterations for IterativeImputer.

    Returns:
        pd.DataFrame: Cleaned and imputed dataset.
    """
    import pandas as pd
    import numpy as np
    from sklearn.experimental import enable_iterative_imputer  # noqa
    from sklearn.impute import IterativeImputer
    from sklearn.preprocessing import LabelEncoder

    # Copy to avoid modifying original data
    df = df.copy()

    # Initialize label encoders
    label_encoders = {}

    # Encode categorical columns
    if categorical_cols:
        for col in categorical_cols:
            le = LabelEncoder()
            df[col] = le.fit_transform(df[col].astype(str))  # Handle NaNs as string
            label_encoders[col] = le

    # Impute missing values using IterativeImputer
    if impute_cols:
        imputer = IterativeImputer(max_iter=max_iter, random_state=0)
        df_impute = df[impute_cols]
        df_imputed = pd.DataFrame(imputer.fit_transform(df_impute), columns=impute_cols)
        df[impute_cols] = df_imputed

        # Optional: Round imputed values if the columns were originally categorical
        for col in impute_cols:
            if categorical_cols and col in categorical_cols:
                df[col] = df[col].round().astype(int)

    # Inverse transform to get original labels back
    if categorical_cols:
        for col in categorical_cols:
            le = label_encoders[col]
            df[col] = le.inverse_transform(df[col].astype(int))

    return df


In [None]:
import seaborn as sns
titanic = sns.load_dataset('titanic')

# Define columns
categorical_columns = ['sex', 'embarked', 'who', 'deck', 'class', 'embark_town', 'alive']
columns_to_impute = ['age', 'embarked', 'embark_town', 'deck']

# Preprocess using universal function
clean_df = preprocess_data_with_mice(
    df=titanic,
    categorical_cols=categorical_columns,
    impute_cols=columns_to_impute
)

# Check result
clean_df.head()
clean_df.isnull().sum()


Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,0
class,0
who,0


# 5.2  Deep Learning comming soon...