# Imputation Methods
by Wilmer Garzón, last updated: 30-June-2025

This notebook demonstrates various imputation methods for handling missing data. We will use a sample dataset with missing values and apply different techniques to fill in the gaps. Finally, we will compare the methods using Mean Squared Error (MSE).

In [None]:
import pandas as pd
import numpy as np
#Imports KNNImputer to fill missing values using nearest neighbors.
from sklearn.impute import KNNImputer
#Imports LinearRegression to build linear models
from sklearn.linear_model import LinearRegression
#Imports mean_squared_error to evaluate prediction accuracy.
from sklearn.metrics import mean_squared_error
#Imports train_test_split to divide data into training and test sets.
from sklearn.model_selection import train_test_split

## Dataset Description:
- **Number of samples**: 500
- **Number of features**: 6 numerical features
- **Variables**: Feature_1, Feature_2, ..., Feature_6
- **Target variable**: Target
- **Missing values**: None, the dataset is fully complete

In [None]:
# I didn't find the file, so I created one with the same features above
import pandas as pd
import numpy as np
# Load the dataset
url = "https://raw.github.com/Heitor-vn/Data-Mining---UDC/main/complete_dataset.csv"
data = pd.read_csv(url, sep=';')

# Converte ',' for '.'
for col in data.columns:
    data[col] = data[col].str.replace(',', '.').astype(float)

data.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Target
0,42.472407,77.042858,63.919637,55.919509,29.361118,29.359671,308.674565
1,23.485017,71.970569,56.066901,62.484355,21.23507,78.194591,319.606561
2,69.946558,32.740347,30.909498,31.004271,38.254535,51.485386,261.176287
3,45.916701,37.473748,56.711174,28.369632,37.528679,41.981711,234.322087
4,47.364199,67.110558,31.980427,50.854066,55.544874,22.787025,287.760589


In [None]:
# Check for missing values
data.isnull().sum()

Unnamed: 0,0
Feature_1,0
Feature_2,0
Feature_3,0
Feature_4,0
Feature_5,0
Feature_6,0
Target,0


## Exercises:
- Please complete the following tasks step by step.
- Each numbered item corresponds to a specific imputation technique or evaluation method that you must implement and analyze.

**STEP 0: INTRODUCE 10% OF MISSING VALUES**

In [None]:
# Create 10% of missing values in data using mean of each feature and with random application
data_with_missing = data.copy()

# Put 10% of missing value por column
missing_percentage = 0.1
for column in data_with_missing.columns:
    missing_indices = np.random.choice(data_with_missing.index, size=int(len(data_with_missing) * missing_percentage), replace=False)
    data_with_missing.loc[missing_indices, column] = np.nan

# Check missing values
print(data_with_missing.isnull().sum())


Feature_1    50
Feature_2    50
Feature_3    50
Feature_4    50
Feature_5    50
Feature_6    50
Target       50
dtype: int64


### Imputation with Mean
*Impute missing values by replacing them with the mean of each feature*

In [None]:
#YOUR CODE HERE
# Impute the missing values by the mean of each column
data_with_missing_mean = data_with_missing.copy()
data_with_missing_mean = data_with_missing_mean.fillna(data_with_missing_mean.mean(numeric_only=True))

# Check missing values
print(data_with_missing_mean.isnull().sum())

Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Feature_6    0
Target       0
dtype: int64


### Imputation with Median
*Impute missing values using the median of each feature to reduce the impact of outliers*

In [None]:
#YOUR CODE HERE
# Impute the missing values by the median of each column
data_with_missing_median = data_with_missing.copy()
data_with_missing_median = data_with_missing_median.fillna(data_with_missing_median.median(numeric_only=True))

# Check missing values
print(data_with_missing_median.isnull().sum())

Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Feature_6    0
Target       0
dtype: int64


### Imputation with KNN
*Use the K-Nearest Neighbors (KNN) algorithm to impute missing values based on similar observations*

In [None]:
#YOUR CODE HERE
import pandas as pd
from sklearn.impute import KNNImputer
# Impute the missing values by using KNN = 5 neighboors
data_with_missing_knn = data_with_missing.copy()
imputer = KNNImputer(n_neighbors=5)
data_with_missing_knn = pd.DataFrame(imputer.fit_transform(data_with_missing_knn), columns=data_with_missing_knn.columns)

print(data_with_missing_knn.isnull().sum())



Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Feature_6    0
Target       0
dtype: int64


### Imputation with Linear Regression
*Train a linear regression model on the available data and use it to predict and fill in missing values*

In [None]:
# We'll predict Feature3 using Feature1 and Feature2
#YOUR CODE HERE
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Impute the missing values by using Linear Regression for the columns without errors
data_with_missing_lr = data_with_missing.copy()

for col in data_with_missing_lr.columns:
    if data_with_missing_lr[col].isnull().sum() > 0:
        # Separar linhas sem missing nessa coluna
        df_train = data_with_missing_lr[data_with_missing_lr[col].notnull()]
        df_missing = data_with_missing_lr[data_with_missing_lr[col].isnull()]

        X_train = df_train.drop(columns=[col])
        y_train = df_train[col]

        X_missing = df_missing.drop(columns=[col])

        # Se existirem missing nas features, imputar por média temporariamente
        X_train = X_train.fillna(X_train.mean())
        X_missing = X_missing.fillna(X_train.mean())

        model = LinearRegression()
        model.fit(X_train, y_train)

        # Prever valores faltantes
        predicted = model.predict(X_missing)

        # Imputar os valores preditos no DataFrame original
        data_with_missing_lr.loc[data_with_missing_lr[col].isnull(), col] = predicted

print(data_with_missing_lr.isnull().sum())

Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Feature_6    0
Target       0
dtype: int64


### Mean Squared Error (MSE)
- The MSE is a common metric used to evaluate the accuracy of a model or method by measuring the average of the squares of the errors.
- The average squared difference between the actual (true) values and the predicted or imputed values.

![MSE](https://miro.medium.com/v2/resize:fit:720/format:webp/0*ox49JmZ2YkKrqG9N.jpg)

### Compare Imputation Methods using MSE
*Evaluate and compare imputation methods by calculating the Mean Squared Error (MSE) between imputed and true values*

In [None]:
# Compare all the imputations made above using MSE
import numpy as np
from sklearn.metrics import mean_squared_error

# Check where was the error befor the imputation
missing_indices = data_with_missing['Feature_3'].isnull()

# MSE
mse_mean = mean_squared_error(data.loc[missing_indices, 'Feature_3'], data_with_missing_mean.loc[missing_indices, 'Feature_3'])
mse_median = mean_squared_error(data.loc[missing_indices, 'Feature_3'], data_with_missing_median.loc[missing_indices, 'Feature_3'])
mse_knn = mean_squared_error(data.loc[missing_indices, 'Feature_3'], data_with_missing_knn.loc[missing_indices, 'Feature_3'])
mse_lr = mean_squared_error(data.loc[missing_indices, 'Feature_3'], data_with_missing_lr.loc[missing_indices, 'Feature_3'])

print(f"MSE Mean Imputation: {mse_mean}")
print(f"MSE Median Imputation: {mse_median}")
print(f"MSE KNN Imputation: {mse_knn}")
print(f"MSE Linear Regression Imputation: {mse_lr}")

MSE Mean Imputation: 340.38270856124757
MSE Median Imputation: 343.39387326203496
MSE KNN Imputation: 389.0884623947103
MSE Linear Regression Imputation: 209.53055017532054
