<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/06.%20KNN%20Imputer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **KNN Imputer: A Distance-Based Approach to Fill NaNs**

<div align="justify">

The KNN Imputer (K-Nearest Neighbors Imputer) fills missing values by finding the K most similar instances (rows) based on the available features and using their values to impute the missing ones.

#### <u> KNN IMPUTER ALGORITHM </u>:-

**Step-1** Choose a missing value to fill.<br>
**Step-2** Select the other values in that row.<br>
**Step-3** Choose the number of neighbors. (n_neighbors)<br>
**Step-4** Calculate `nan_eucledian` distance from all the other corresponding row elements.

### 📐 `nan_euclidean` Distance Formula

Given two samples:  
**x = [x₁, x₂, ..., x_d]**  
**y = [y₁, y₂, ..., y_d]**

Let:
- **V** &nbsp;  = indices where both `xᵢ` and `yᵢ` are **not missing**
- **|V|** = number of such valid dimensions
- **d**  &nbsp; = total number of features (columns)

The distance is calculated as:

$$
\text{nan_euclidean}(x, y) = \sqrt{ \frac{ \sum\limits_{i \in V} (x_i - y_i)^2 }{|V|} \cdot d }
$$

Or more simply:

$$
\text{distance} = \sqrt{ \left( \frac{\text{sum of squared differences on valid features}}{\text{number of valid features}} \right) \times \text{total features} }
$$

Below is a sample dataset with missing values (denoted by `--`) used to demonstrate how the KNN Imputer fills in missing entries based on similar rows:

|     | Friends | HIMYM | GOT | Suits | Breaking Bad |
|-----|---------|-------|-----|--------|---------------|
| 0   | 80      | 30    | 7   | 14     | 27            |
| 1   | 44      | --    | 10  | 0      | 29            |
| 2   | --      | 85    | 25  | 5      | 88            |
| 3   | 50      | 70    | 74  | 9      | 49            |
| 4   | 29      | 54    | 49  | 20     | --            |


</div>

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
data = {
    'Friends': [80, 44, np.nan, 50, 29],
    'HIMYM': [30, np.nan, 85, 70, 54],
    'GOT': [7, 10, 25, 74, 49],
    'Suits': [14, 0, 5, 9, 20],
    'Breaking Bad': [27, 29, 88, 49, np.nan]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Friends,HIMYM,GOT,Suits,Breaking Bad
0,80.0,30.0,7,14,27.0
1,44.0,,10,0,29.0
2,,85.0,25,5,88.0
3,50.0,70.0,74,9,49.0
4,29.0,54.0,49,20,


#### Mathematical Calulations: (For simplicity, we choose `n_neighbors =2`)
🎯 Impute **row 2, column "Friends"** using KNN Imputer with k = 2
<table>
  <tr><th>Row</th><th>Friends</th><th>HIMYM</th><th>GOT</th><th>Suits</th><th>Breaking Bad</th></tr>
  <tr><td>0</td><td>80</td><td>30</td><td>7</td><td>14</td><td>27</td></tr>
  <tr><td>1</td><td>44</td><td>NaN</td><td>10</td><td>0</td><td>29</td></tr>
  <tr><td>2</td><td style="background-color: #ffff99;"><b>NaN</b></td><td>85</td><td>25</td><td>5</td><td>88</td></tr>
  <tr><td>3</td><td>50</td><td>70</td><td>74</td><td>9</td><td>49</td></tr>
  <tr><td>4</td><td>29</td><td>54</td><td>49</td><td>20</td><td>NaN</td></tr>
</table>

**Step 1: Compute `nan_euclidean` distances from row 2 to others**  
Using formula:  
$$
\text{distance}(x, y) = \sqrt{ \left( \frac{\sum_{i \in V} (x_i - y_i)^2}{|V|} \right) \times d }
$$  
Where:  
- $V$ &nbsp;&nbsp;: valid (non-missing) features  
- $|V|$: number of valid features  
- $d$: total features, here $d=5$

**Distance to Row 4:**

Common features: HIMYM, GOT, Suits  
|V| = 3

- (85 − 54)² = 961  
- (25 − 49)² = 576  
- (5 − 20)² &nbsp;&nbsp;= 225  
Total = 1762

$$
\text{distance}_{2,4} = \sqrt{ \left( \frac{1762}{3} \right) \times 5 }
= \sqrt{587.33 \times 5}
= \sqrt{2936.66}
\approx 54.20
$$

**Distance to Row 3:**

Common features: HIMYM, GOT, Suits, Breaking Bad  
|V| = 4

- (85 − 70)² = 225  
- (25 − 74)² = 2401  
- (5 − 9)² = 16  
- (88 − 49)² = 1521  
Total = 4163

$$
\text{distance}_{2,3} = \sqrt{ \left( \frac{4163}{4} \right) \times 5 }
= \sqrt{1040.75 \times 5}
= \sqrt{5203.75}
\approx 72.13
$$

**Distance to Row 1**

Common features: GOT, Suits, Breaking Bad  
|V| = 3

- (25 − 10)² = 225  
- (5 − 0)² = 25  
- (88 − 29)² = 3481  
Total = 3731

$$
\text{distance}_{2,1} = \sqrt{ \left( \frac{3731}{3} \right) \times 5 }  
= \sqrt{1243.67 \times 5}  
= \sqrt{6218.33}  
\approx 78.84
$$

**Distance Row 2 to Row 0**

Common features: HIMYM, GOT, Suits, Breaking Bad  
|V| = 4

- (85 − 30)² = 3025  
- (25 − 7)² = 324  
- (5 − 14)² = 81  
- (88 − 27)² = 3721  
Total = 7151

$$
\text{distance}_{2,0} = \sqrt{ \left( \frac{7151}{4} \right) \times 5 }  
= \sqrt{1787.75 \times 5}  
= \sqrt{8938.75}  
\approx 94.56
$$
<br>

| Neighbor Row | Distance | Friends |
|--------------|----------|---------|
| Row 0        | 94.56    | 80      |
| Row 1        | 78.84    | 44      |
| Row 3        | 72.13    | 50      |
| Row 4        | 54.20    | 29      |


🧮 **Step 2: Impute the missing value using mean of neighbors**

We choose `Friends` value where calculate nan_eucledian distance is lowest. For 2 neighbors, we take lowest 2 distances.
$$
\text{Imputed Friends} = \frac{29 + 50}{2} = \frac{79}{2} = 39.5
$$

**Final Answer**: **Imputed Value for Row 2 → `Friends = 39.5`**

In [None]:
# Apply KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df_imputed

Unnamed: 0,Friends,HIMYM,GOT,Suits,Breaking Bad
0,80.0,30.0,7.0,14.0,27.0
1,44.0,42.0,10.0,0.0,29.0
2,39.5,85.0,25.0,5.0,88.0
3,50.0,70.0,74.0,9.0,49.0
4,29.0,54.0,49.0,20.0,68.5


We can observe that the imputed value in `Friends` column in row 2 (by index) matches with our computed value. Hence, the working of the algorithm.

Now, we apply knn imputer to our `income_evaluation` dataset that we have been working on.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving income_evaluation.csv to income_evaluation.csv


In [None]:
df = pd.read_csv('income_evaluation.csv', na_values=' ?')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
# Checking missing values
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,1836
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,1843
relationship,0
race,0
sex,0


For exploration and learning purposes, we are intentionally injecting NaN values into some numerical features also to simulate missing data scenarios.

In [None]:
# hours per week missing values
np.random.seed(seed=0)
h = np.random.choice(a=df.index, replace=False, size=20)
df.loc[h, ' hours-per-week'] = np.nan

In [None]:
# age missing values
np.random.seed(seed=10)
a = np.random.choice(a=df.index, replace=False, size=28)
df.loc[a, 'age'] = np.nan

In [None]:
df.isna().sum()

Unnamed: 0,0
age,28
workclass,1836
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,1843
relationship,0
race,0
sex,0


In [None]:
# separate independent and dependent features
X = df.drop(' income', axis=1)
y = df[' income']

# train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

In [None]:
x_train.isnull().sum()

Unnamed: 0,0
age,23
workclass,1468
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,1475
relationship,0
race,0
sex,0


In [None]:
from sklearn.impute import KNNImputer
knn = KNNImputer(n_neighbors=5) # default value is 5

In [None]:
knn.fit_transform(x_train)

ValueError: could not convert string to float: ' Private'

In [None]:
num = [col for col in x_train.columns if x_train[col].dtypes != 'O']

In [None]:
knn.fit_transform(x_train[num])

array([[5.50000e+01, 2.38216e+05, 9.00000e+00, 0.00000e+00, 0.00000e+00,
        4.00000e+01],
       [2.40000e+01, 3.06460e+05, 9.00000e+00, 0.00000e+00, 0.00000e+00,
        4.00000e+01],
       [4.80000e+01, 2.13140e+05, 4.00000e+00, 0.00000e+00, 0.00000e+00,
        4.00000e+01],
       ...,
       [8.50000e+01, 1.66027e+05, 9.00000e+00, 0.00000e+00, 0.00000e+00,
        5.00000e+01],
       [3.60000e+01, 4.69056e+05, 9.00000e+00, 0.00000e+00, 0.00000e+00,
        2.50000e+01],
       [2.60000e+01, 1.98163e+05, 1.40000e+01, 0.00000e+00, 0.00000e+00,
        4.00000e+01]])

In [None]:
pd.DataFrame(knn.fit_transform(x_train[num]))

Unnamed: 0,0,1,2,3,4,5
0,55.0,238216.0,9.0,0.0,0.0,40.0
1,24.0,306460.0,9.0,0.0,0.0,40.0
2,48.0,213140.0,4.0,0.0,0.0,40.0
3,36.0,127306.0,13.0,0.0,0.0,40.0
4,53.0,103586.0,13.0,0.0,0.0,55.0
...,...,...,...,...,...,...
26043,54.0,220115.0,10.0,0.0,0.0,30.0
26044,42.0,211517.0,7.0,0.0,0.0,40.0
26045,85.0,166027.0,9.0,0.0,0.0,50.0
26046,36.0,469056.0,9.0,0.0,0.0,25.0


In [None]:
pd.DataFrame(knn.fit_transform(x_train[num])).isna().sum().sum()

np.int64(0)

In [None]:
x_test[num].isna().sum()

Unnamed: 0,0
age,5
fnlwgt,0
education-num,0
capital-gain,0
capital-loss,0
hours-per-week,1


In [None]:
knn.transform(x_test[num])

array([[3.20000e+01, 2.60954e+05, 7.00000e+00, 0.00000e+00, 2.04200e+03,
        3.00000e+01],
       [3.10000e+01, 2.36391e+05, 1.00000e+01, 0.00000e+00, 0.00000e+00,
        4.00000e+01],
       [5.90000e+01, 1.75689e+05, 1.00000e+01, 0.00000e+00, 0.00000e+00,
        1.40000e+01],
       ...,
       [2.60000e+01, 1.77482e+05, 1.20000e+01, 0.00000e+00, 0.00000e+00,
        4.50000e+01],
       [4.70000e+01, 2.58498e+05, 1.00000e+01, 0.00000e+00, 0.00000e+00,
        5.20000e+01],
       [4.50000e+01, 1.60962e+05, 1.00000e+01, 0.00000e+00, 0.00000e+00,
        3.50000e+01]])

In [None]:
pd.DataFrame(knn.transform(x_test[num]))

Unnamed: 0,0,1,2,3,4,5
0,32.0,260954.0,7.0,0.0,2042.0,30.0
1,31.0,236391.0,10.0,0.0,0.0,40.0
2,59.0,175689.0,10.0,0.0,0.0,14.0
3,37.0,114765.0,10.0,0.0,0.0,40.0
4,40.0,179717.0,13.0,0.0,1564.0,60.0
...,...,...,...,...,...,...
6508,48.0,125892.0,5.0,0.0,0.0,40.0
6509,37.0,186934.0,7.0,3103.0,0.0,44.0
6510,26.0,177482.0,12.0,0.0,0.0,45.0
6511,47.0,258498.0,10.0,0.0,0.0,52.0


In [None]:
pd.DataFrame(knn.transform(x_test[num])).isna().sum().sum()

np.int64(0)

<div align="justify">

**Note:** When using `KNNImputer`, the `.transform()` method imputes missing values in the test set by finding the nearest neighbors from the training set (i.e., it does not compute neighbors within the test set itself).

#### **Key Differences Between `SimpleImputer` and `KNNImputer` :-**

`SimpleImputer` replaces missing values using basic strategies like mean, median, mode, or a constant value, working **independently on each column**. It is very fast and best suited when data is missing completely at random (MCAR).

In contrast, `KNNImputer` uses the values of the k nearest neighbors—determined by feature similarity—to fill in missing values. This method takes into account multiple features, making it more context-aware and accurate when **features are correlated or missingness has a pattern**. While KNNImputer is more powerful, it is also computationally heavier.

> It is recommended to check feature correlations (e.g., using a heatmap) before applying KNNImputer, as it works best when features are correlated and can meaningfully contribute to distance calculations.

</div>