# Imputation Methods
by Wilmer Garzón, last updated: 30-June-2025

This notebook demonstrates various imputation methods for handling missing data. We will use a sample dataset with missing values and apply different techniques to fill in the gaps. Finally, we will compare the methods using Mean Squared Error (MSE).

In [None]:
import pandas as pd
import numpy as np
#Imports KNNImputer to fill missing values using nearest neighbors.
from sklearn.impute import KNNImputer
#Imports LinearRegression to build linear models
from sklearn.linear_model import LinearRegression
#Imports mean_squared_error to evaluate prediction accuracy.
from sklearn.metrics import mean_squared_error
#Imports train_test_split to divide data into training and test sets.
from sklearn.model_selection import train_test_split

## Dataset Description:
- **Number of samples**: 500
- **Number of features**: 6 numerical features
- **Variables**: Feature_1, Feature_2, ..., Feature_6
- **Target variable**: Target
- **Missing values**: None, the dataset is fully complete

In [None]:
# Load the dataset
data = pd.read_csv('complete_dataset.csv')
data.head()

Unnamed: 0,Feature_1,Feature_2,Feature_3,Feature_4,Feature_5,Feature_6,Target
0,57.450712,47.926035,59.715328,72.845448,46.487699,46.487946,377.845002
1,73.688192,61.511521,42.957884,58.138401,43.048735,43.014054,401.933827
2,53.629434,21.300796,24.126233,41.565687,34.807533,54.71371,266.092906
3,36.379639,28.815444,71.984732,46.613355,51.012923,28.628777,272.331966
4,41.834259,51.663839,32.735096,55.63547,40.99042,45.624594,320.151692


In [None]:
# Check for missing values
data.isnull().sum()

Feature_1    0
Feature_2    0
Feature_3    0
Feature_4    0
Feature_5    0
Feature_6    0
Target       0
dtype: int64

## Exercises:
- Please complete the following tasks step by step.
- Each numbered item corresponds to a specific imputation technique or evaluation method that you must implement and analyze.

**STEP 0: INTRODUCE 10% OF MISSING VALUES**

### Imputation with Mean
*Impute missing values by replacing them with the mean of each feature*

In [None]:
#YOUR CODE HERE

### Imputation with Median
*Impute missing values using the median of each feature to reduce the impact of outliers*

In [None]:
#YOUR CODE HERE

### Imputation with KNN
*Use the K-Nearest Neighbors (KNN) algorithm to impute missing values based on similar observations*

In [None]:
#YOUR CODE HERE

### Imputation with Linear Regression
*Train a linear regression model on the available data and use it to predict and fill in missing values*

In [None]:
# We'll predict Feature3 using Feature1 and Feature2
#YOUR CODE HERE

### Mean Squared Error (MSE)
- The MSE is a common metric used to evaluate the accuracy of a model or method by measuring the average of the squares of the errors.
- The average squared difference between the actual (true) values and the predicted or imputed values.

![MSE](https://miro.medium.com/v2/resize:fit:720/format:webp/0*ox49JmZ2YkKrqG9N.jpg)

### Compare Imputation Methods using MSE
*Evaluate and compare imputation methods by calculating the Mean Squared Error (MSE) between imputed and true values*