# Handling Missing Values using KNNImputer

KNN Imputation is a technique for filling in missing values in your dataset using the K-Nearest Neighbors algorithm.

Instead of guessing or using mean/median, it finds rows (samples) that are similar to the one with the missing value — based on other non-missing features — and uses their values to fill in the gap.



## How Does it Work?

Let’s say we’re missing a value in a column called BMI for a particular patient.

Here’s what KNN Imputation will do:

1. Find the K most similar rows (neighbors).

2. Similarity is usually calculated using Euclidean distance (for numeric features).

3. Average the values of BMI from those K neighbors (or use a weighted average).

4. Fill in the missing BMI value with that average.


### When to Use KNN Imputation?

**Good When:**

1. You have moderate amount of missing data

2. Your dataset is not too large (KNN can be slow)

3. Features are correlated and on similar scales (standardize first!)

**Avoid When:**

1. Too much missing data

2. Very high-dimensional data (curse of dimensionality)

3. Features are not correlated or poorly scaled



In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
np.random.seed(42)

n = 100
data = pd.DataFrame({
    'Age': np.random.randint(25, 60, size=n),
    'BMI': np.random.normal(25, 4, size=n),
    'Cholesterol': np.random.normal(200, 30, size=n),
    'Gender': np.random.choice(['Male', 'Female'], size=n),
    'Smoker': np.random.choice(['Yes', 'No'], size=n),
    'Risk': np.random.choice(['Low', 'High'], size=n)
})

# Inject missing values into BMI and Cholesterol
data.loc[np.random.choice(n, 10, replace=False), 'BMI'] = np.nan
data.loc[np.random.choice(n, 8, replace=False), 'Cholesterol'] = np.nan

print("🔍 Original Data with Nulls:")
print(data.isnull().sum())

🔍 Original Data with Nulls:
Age             0
BMI            10
Cholesterol     8
Gender          0
Smoker          0
Risk            0
dtype: int64


In [3]:
label_cols = ['Gender', 'Smoker', 'Risk']
label_encoders = {}

for col in label_cols:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

In [4]:
data_knn = data.copy()


In [14]:
data_knn.corr()

Unnamed: 0,Age,BMI,Cholesterol,Gender,Smoker,Risk
Age,1.0,0.294181,0.034,0.070851,0.017535,0.082658
BMI,0.294181,1.0,-0.058582,0.086521,0.092542,-0.056832
Cholesterol,0.034,-0.058582,1.0,-0.082889,-0.286582,0.193503
Gender,0.070851,0.086521,-0.082889,1.0,0.047043,0.040032
Smoker,0.017535,0.092542,-0.286582,0.047043,1.0,-0.040522
Risk,0.082658,-0.056832,0.193503,0.040032,-0.040522,1.0


In [5]:
knn_imputer = KNNImputer(n_neighbors=5)
data_knn_imputed = pd.DataFrame(knn_imputer.fit_transform(data_knn), columns=data.columns)

In [6]:
data_knn_imputed.isnull().sum()

Age            0
BMI            0
Cholesterol    0
Gender         0
Smoker         0
Risk           0
dtype: int64

# Linear Regression used for Missing Values

We use Linear Regression as a predictive model — but instead of predicting a target label like in ML tasks, we predict the missing feature itself using other known features!


## Intuition:

Imagine you're missing someone's Cholesterol level (from the dataset used earlier) but you know:

1. Their Age

2. heir BMI

3. Whether they Smoke

4. Their Gender

Now, if you have a lot of other people’s records where all these things are known, you can train a linear regression model:

“Based on Age, BMI, Gender, and Smoker status, what is the expected Cholesterol?”

## The Steps:

**Separate the data:**

1. One part where Cholesterol is known (for training).

2. Another part where it’s missing (for prediction).

**Train Linear Regression**

1. Predict Cholesterol using the known columns (Age, BMI, etc.).

**Predict Missing Values**

1. Use the model to predict cholesterol for rows where it's missing.

**Fill those missing values**

1. Imputed values go back into the dataset.

## When is this effective?

1. When the feature you’re imputing is correlated with the others

2. You have enough complete rows to train a decent model

3. The relationships are more or less linear