In [1]:
import pandas as pd
import numpy as np

We will create a dataset that has some missing values. This will help us understand how data imputation works. We'll have two columns: age and salary, with some missing values (represented as np.nan).






In [2]:
data = {'age': [25, np.nan, 30, np.nan, 35],
        'salary': [50000, 60000, np.nan, 90000, np.nan]}

dataframe = pd.DataFrame(data)
print(dataframe)

    age   salary
0  25.0  50000.0
1   NaN  60000.0
2  30.0      NaN
3   NaN  90000.0
4  35.0      NaN


One simple way to handle missing data is by replacing the missing values with the mean of the available data. This is straightforward and works well when the data is evenly distributed without outliers.






In [3]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns=dataframe.columns)
print(imputed_df)

    age        salary
0  25.0  50000.000000
1  30.0  60000.000000
2  30.0  66666.666667
3  30.0  90000.000000
4  35.0  66666.666667


Imputation can also be done using the KNN (K-Nearest Neighbors) Imputer. This technique estimates the missing values based on the nearest neighbors in the data, providing a more context-aware imputation.

In [4]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)
imputed_data = imputer.fit_transform(dataframe)
imputed_df = pd.DataFrame(imputed_data, columns=dataframe.columns)
print(imputed_df)

    age   salary
0  25.0  50000.0
1  25.0  60000.0
2  30.0  50000.0
3  25.0  90000.0
4  35.0  50000.0
