# What is Data Imputation

Data Imputation is the process of handling missing values or null values in a dataset for the purpose of enhancing efficiency and accuracy in the model training process. This can be done by either replacing estimated or predicted values. Missing values occur due to various reasons, including, missing information, data entry errors, data deletion, or other inconsistencies. It is essential to handle these values to ensure unbiased predictions and data compatibility with all models. In this article, we will see What is Data Imputation and Techniques to perform it in Machine Learning.

# Identifying Null Values
Null values can be identified by utilizing various pre-defined functions in the pandas library. Some of the methods are listed below:

- isna() is used to detect the missing values in the cells of the pandas data frame. It returns a data frame of the same dimension as the dataset with the values masked as True for null values and False for non-null values.

- isnull() is used as an alternate way of identifying or detecting the missing values in the data frame. It also returns an output similar to the one returned by .isna().

In [1]:
import pandas as pd
import numpy as np

# Creating a dataset for demonstration 
d = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, np.nan, 30, 22, np.nan],
    'Salary': [50000, 60000, np.nan, 45000, 52000],
    'Department': ['HR', 'IT', np.nan, 'Finance', 'HR']
}
df = pd.DataFrame(d)

# Checking presence of null values
df.isnull()

Unnamed: 0,Name,Age,Salary,Department
0,False,False,False,False
1,False,True,False,False
2,False,False,True,True
3,False,False,False,False
4,False,True,False,False


# How Data Imputation works ?

Data Imputation is a statistical approach utilized in Data pre-processing to handle and replace missing, null, or incomplete values in a dataset with estimated, predicted, or aggregated values according to the corresponding feature or attribute. This step ensures the dataset to be complete, consistent, and enhances the model interpretability.

# Techniques for Data Imputation

Data imputation can be performed to transform the data into a complete and consistent form using various ways. Here's an overview of some common approaches:

1. Mean, Median, Mode Imputation
Mean, Median, and Mode are used to fill null values in numerical and categorical variables. This is the most commonly used fundamental approach.

- Mean Imputation: Utilizes aggregated or average value of the entire feature or attribute to replace the missing value. Best for normally distributed numerical data. Not accurate for skewed data
- Median Imputation: Utilizes median value of the entire feature or attribute to replace the missing value. Works well for skewed or unequally distributed numerical data. More accurate and robust than mean imputation
- Mode Imputation: Replaces the missing data with the most frequently used (mode) value. Used to impute categorical features

**Observations:**
- Missing `Age` values were replaced with the column mean (‚âà25.7).
- Missing `Salary` was replaced with the column median (52,000).
- Missing `Department` was replaced with the mode (`HR`).
- The imputed dataset now has no nulls.

**Interpretation and findings:**
- Mean imputation keeps the overall scale but can bias skewed distributions.
- Median imputation is robust to outliers and better for skewed numeric data.
- Mode imputation preserves the most frequent category but may over-represent it.
- These simple methods are fast and easy but can underestimate variance and weaken relationships among features.

In [2]:
# Initialize new instance for dataset
df2 = df.copy()

# Using Mean to impute the null values
df2['Age'].fillna(df2['Age'].mean(), inplace=True)

# Using Median to impute the null values
df2['Salary'].fillna(df2['Salary'].median(), inplace=True)

# Using mode to impute null values
df2['Department'].fillna(df2['Department'].mode()[0], inplace=True)

df2

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Age'].fillna(df2['Age'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2['Salary'].fillna(df2['Salary'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25.0,50000.0,HR
1,Bob,25.666667,60000.0,IT
2,Charlie,30.0,51000.0,HR
3,David,22.0,45000.0,Finance
4,Eva,25.666667,52000.0,HR


2. Forward Fill, Backward Fill Imputation
Forward fill and Backward fill are imputation techniques based on a approach that uses nearest data point.

- Forward fill generates the last known non-null value as the imputed data using "ffill" method in pandas library while replacement. It is generally used in time series data.

**Observations:**
- Forward fill (ffill) replaced each null with the last seen non-null value above it.
- Backward fill (bfill) replaced each null with the next available non-null value below it.
- After ffill/bfill, no nulls remain in the sample dataset.

**Interpretation and findings:**
- Forward fill is suitable for time-ordered data where the last known value is a reasonable carry-over; it can propagate stale values if long gaps exist.
- Backward fill uses future observations, which may not be valid in strict time-causal settings but can work for static ordered data.
- Both methods are fast and simple but may distort trends and underrepresent variability when gaps are large or frequent.

In [3]:
# Initializing dataset
df_ff = df.copy()

# Imputing null values with previous known value
df_ff.fillna(method='ffill', inplace=True)

df_ff

  df_ff.fillna(method='ffill', inplace=True)


Unnamed: 0,Name,Age,Salary,Department
0,Alice,25.0,50000.0,HR
1,Bob,25.0,60000.0,IT
2,Charlie,30.0,60000.0,IT
3,David,22.0,45000.0,Finance
4,Eva,22.0,52000.0,HR


In [4]:
# Initializing dataset
df_bf = df.copy()

# Imputing null values with next known value
df_bf.fillna(method='bfill', inplace=True)

df_bf

  df_bf.fillna(method='bfill', inplace=True)


Unnamed: 0,Name,Age,Salary,Department
0,Alice,25.0,50000.0,HR
1,Bob,30.0,60000.0,IT
2,Charlie,30.0,45000.0,Finance
3,David,22.0,45000.0,Finance
4,Eva,,52000.0,HR


3. Regression Imputation

- This approach predicts missing values using regression based models.
- Linear regression or Non-linear regression can be used
- Relevant for numeric data
- More accurate estimation
- Can be overfitted if not handled properly

**Observations:**
- Trained a linear regression on rows with non-null `Salary` and `Age`.
- Predicted missing `Age` only where `Salary` was available.
- Rows with both `Salary` and `Age` non-null remained unchanged.

**Interpretation and findings:**
- Assumes a linear relationship between `Salary` and `Age`; may be weak with few rows.
- Cannot impute `Age` where `Salary` is also missing (those stay NaN).
- More data or additional predictors could reduce variance and improve estimates; regularization or cross-validation helps avoid overfitting on small samples.

In [5]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

# Drop rows where either 'Salary' or 'Age' is NaN for training
df_train = df.dropna(subset=['Salary', 'Age'])

# Fit the model using the cleaned data
model.fit(df_train[['Salary']], df_train['Age'])

# Predict missing 'Age' values : If 'Age' is null and get corresponding 'Salary'
m_age_s = df.loc[df['Age'].isnull(), ['Salary']]

# Ensure Salary values are not NaN before predicting
m_age_s_c = m_age_s.dropna(subset=['Salary'])

# Predict the missing 'Age' values
pre_ages = model.predict(m_age_s_c[['Salary']])
update_idx = m_age_s_c.index
df.loc[update_idx, 'Age'] = pre_ages

df

Unnamed: 0,Name,Age,Salary,Department
0,Alice,25.0,50000.0,HR
1,Bob,31.0,60000.0,IT
2,Charlie,30.0,,
3,David,22.0,45000.0,Finance
4,Eva,26.2,52000.0,HR


4. k-Nearest Neighbors Imputation
This approach estimates missing data on the basis of other relevant features or variables.
It uses distance as the parameter.
It analyzes k nearest neighbors or data points and the multivariate relationships between them.
This is a computationally expensive approach.
Less efficient for large datasets.
‚Äã‚Äã
d(i,j)=‚àëk‚Äã(Xi,k‚Äã‚àíXj,k‚Äã)2‚Äãd(i,j)=‚àëk‚Äã‚Äã(Xi,k‚Äã‚Äã‚àíXj,k‚Äã‚Äã)2‚Äã‚Äã 

where ùëãùëñ, ùëò and ùëãùëó, ùëò are observed values for sample ùëñ and ùëó

5. Multiple Imputation

This technique generates multiple predicted values corresponding to the missing values. Furthermore, these values are used in creation of several imputed datasets. Each imputed dataset is analyzed and the results are clubbed.

# Applications of Data Imputation
Data imputation is a necessity while pre-processing and must be catered to. It has a wide range of applications in real world. Some of these applications are as follows:

1. Finance and Banking Sector: Handling gaps in transaction, and investment data
2. Healthcare Diagnosis: Complete missing details in patient records for improved analysis
3. Market Research Analysis: Completing customer response to extract better insights
4. Government Sector: Filling gaps in survey responses and questionnaires
5. Pattern Recognition: Complete and consistent data facilitates better recognition of patterns.

# Advantages of Data Imputation
Some of the key advantages of Data Imputation are:

1. Simple to implement and easy to understand
2. Ensures consistency and completeness in data
3. Enhances model performance and prediction accuracy
4. Better pattern recognition and attribute relation capturing

# Disadvantages of Data Imputation
Some of the key disadvantages of Data Imputation are:

1. Simple imputation techniques fail to capture non-linear and complex relationships
2. Variance can be reduced by Mean and Median imputation
3. k-NN is computationally expensive
4. Possible bias if incorrect assumptions.