<a href="https://colab.research.google.com/github/Kiran-Pokhrel-91/Data/blob/main/Imputing_missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Six Important Ways of Imputing Missing Values <br>
You can impute missing values using machine learning model which is known as data imputaion and is commonly used in data preprocessing to handle missing or incomoplete data.Some methods are as follows:- <br>
1. **Simple Imputatuin technique**
    - Mean / Median Imputation
    - Mode Imputation
2. **K-Nearest Neighbors(KNN)**
3. **Regression Imputation**
4. **Decission Tree and Random Forest**
5. **Advance Technique:**
    - Multiple Imputation by Chain Equation (MICE)
    - Deep Learning Methods
6. **Time-Series Specific Method** <br>

It is important to choose right method based on tyopes of data, pattern of missingness and amount of missing data.

# 1. Simple Imputatuion: Mean, Median, and Mode Imputation

When working with real-world datasets, it's common to encounter missing values. These need to be handled appropriately to avoid skewing analysis or model performance.

## 1.1 Mean and Median Imputation

Mean and median imputation are numerical imputation strategies used to fill in missing values in continuous features.

- **Mean imputation** replaces missing values with the average of the available values.
- **Median imputation** replaces missing values with the median (middle value) of the available values.

These methods are simple and often effective, but they can:
- Underestimate the variability (variance) in the dataset.
- Introduce bias if data are **not missing at random (NMAR)**.


## 1.2 Mode Imputation

Mode imputation is typically used for categorical features. It replaces missing values with the most frequent (mode) value in the column.

Let's look at an example using the `embarked` and `embark_town` columns.
      
**Remember:** These are basic imputation strategies. For more robust handling of missing data, consider advanced techniques like:
- K-Nearest Neighbors (KNN) imputation
- Multivariate imputation (e.g., MICE)
- Using machine learning models to predict missing values


In [None]:
import pandas as pd
import seaborn as sns

data = sns.load_dataset('titanic')
print(data.isnull().sum().sort_values(ascending=False))

deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64


In [None]:
data = (
    data.assign(
        age = data['age'].fillna(data['age'].mean()),
        embarked = data['embarked'].fillna(data['embarked'].mode()[0]),
        embark_town = data['embark_town'].fillna(data['embark_town'].mode()[0])
    )
    .drop(columns='deck')
)


In [None]:
print(data.isnull().sum().sort_values(ascending=False))

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
embark_town    0
alive          0
alone          0
dtype: int64


# 2. KNN Imputation for Handling Missing Values

---

While simple methods like mean, median, and mode imputation are useful, they have limitations—particularly when data are **not missing at random** or have complex relationships.

To address this, we can use a **machine learning approach**: **K-Nearest Neighbors (KNN) imputation**.

---

## What is KNN Imputation?

KNN imputation finds the most similar observations (neighbors) to the sample with the missing value, based on other available features. The missing value is then imputed with:

- The **mean** (for continuous variables)
- Or **mode** (for categorical variables) of the neighbors

KNN can preserve more complex patterns in the data compared to simpler methods.

---

## Summary

KNN Imputation is a powerful alternative to simple imputation methods, especially when:

- Relationships between features are nonlinear
- You want to preserve multivariate structure
- Data has missing values in more than one column

Next, you could explore **MICE (Multivariate Imputation by Chained Equations)** for even more sophisticated imputation!



In [None]:
# Import required libraries
import pandas as pd
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Select relevant features (numeric only for KNN)
df_knn = df[['age', 'fare', 'pclass', 'sibsp', 'parch']].copy()
df_knn.isnull().sum()

Unnamed: 0,0
age,177
fare,0
pclass,0
sibsp,0
parch,0


In [None]:
# Scale the features
# KNN is a distance-based algorithm, so it is sensitive to feature scales. We'll scale the features using `StandardScaler` to ensure fair distance computation.
scaler = StandardScaler()
df_knn_scaled = pd.DataFrame(scaler.fit_transform(df_knn), columns=df_knn.columns)

# Initialize and apply KNN imputer
imputer = KNNImputer(n_neighbors=5)
df_knn_imputed = pd.DataFrame(imputer.fit_transform(df_knn_scaled), columns=df_knn.columns)

# Revert to original scale
df_knn_final = pd.DataFrame(scaler.inverse_transform(df_knn_imputed), columns=df_knn.columns)

df['age'] = df_knn_final['age']
df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


# 3. Regression Imputation: A Simple Machine Learning Approach for Missing Data

---

## What is Regression Imputation?

Regression imputation uses a **regression model** to predict missing values based on other observed features in the dataset.

- For **numerical variables**, a regression model (e.g., linear regression) is trained on samples with observed values.  
- The trained model then **predicts missing values** by leveraging correlations with other variables.

This method works well when there is a strong relationship between the variable with missing values and other features.

---

## Why Use Regression Imputation?

Regression imputation can:  
- Capture linear or even nonlinear relationships between variables (if you use nonlinear regression).  
- Preserve more information than simple imputation, leading to more realistic and accurate imputations.  
- Handle missing values even when data are **not missing completely at random**.

---

## Summary

Regression imputation is a powerful alternative to mean/median imputation, especially when:  
- There are meaningful dependencies between variables.  
- You want to leverage available information to make more accurate imputations.  
- Missing data occurs in important continuous variables.



In [14]:
# Import required libraries
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression

# Load Titanic dataset
df = sns.load_dataset('titanic')

df = df.drop(columns='deck')

df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,177
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


In [15]:
# Encode 'Sex' column to numeric
df['sex'] = df['sex'].map({'male': 0, 'female': 1})

# Select features to use in regression
features = ['pclass', 'sex', 'sibsp', 'parch', 'fare']

# Split into rows with and without age
df_known_age = df[df['age'].notnull()]
df_missing_age = df[df['age'].isnull()]

# Define training data
X_train = df_known_age[features]
y_train = df_known_age['age']

# Define test data (where age is missing)
X_test = df_missing_age[features]

# Initialize and train the model
reg = LinearRegression()
reg.fit(X_train, y_train)

# Predict missing age values
predicted_ages = reg.predict(X_test)

# Replace missing age values with predictions
df.loc[df['age'].isnull(), 'age'] = predicted_ages

df.isnull().sum()

Unnamed: 0,0
survived,0
pclass,0
sex,0
age,0
sibsp,0
parch,0
fare,0
embarked,2
class,0
who,0


# 4. Random Forests for Imputing Missing Values

---

## What is Random Forest Imputation?

**Random Forest Imputation** uses an ensemble of decision trees (a random forest) to predict and fill in missing values in a dataset.

- For **numerical variables**, the random forest regressor is used.
- For **categorical variables**, the random forest classifier is used.
- The model is trained using the rows where the target feature is **not missing**, and it predicts values for rows where it **is missing**.

Unlike simpler methods, random forests can capture **nonlinear relationships and interactions** between features.

---

## Why Use Random Forests for Imputation?

Random Forest imputation is **robust and powerful** because it:

- Handles **nonlinear dependencies** between variables.
- Can model **interactions** between features automatically.
- Works for both **numerical** and **categorical** data.
- Can achieve **higher accuracy** than mean/median or regression imputation in many cases.
- Is **non-parametric** — no assumption of linearity or distribution of data.

---

## Limitations

- **Computationally intensive**: Especially for large datasets or with many missing values.
- **May overfit**: If the number of trees is too high and proper regularization isn't applied.
- **Black-box model**: Random forests are less interpretable compared to simpler models like linear regression.
- **Slower training**: Compared to simpler imputation methods like mean/median imputation.

---

## Summary

Random Forest Imputation is a **flexible and accurate** method for handling missing values. It is especially useful when:

- There are **complex or nonlinear relationships** among features.
- You need to impute both **numerical and categorical** data.
- You want to **leverage patterns** in the data instead of applying generic values.

Although it is more computationally expensive, it often yields **better performance** than traditional imputation techniques.


