# K-Nearest Neighbours (KNN)  
[source](https://www.youtube.com/watch?v=-fK-xEev2I8&list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH&index=39)  
<br><br>
![ytss](assets/KNN_intro.png)
<br><br>

The formula is almost similar to *Euclidean Distance's* formula, only difference is a new term called weight. its calculated by : `How much features are we taking into account / how much features have non-null values` in the observations chosen by "K", this the example shown in the picture its giong to be 3/3 so ultimately being 1, meaning this time it would be exactly same as euclidean distance's formula.
In the above example I can understand what's going on.
See the following examples in ss as well:  <br><br>
![ytss](assets/KNN_example.png) <br><br>

After finding out the nan-euclidean distance to k-number of closest observations. next thing is to take values from that same feature (where you are imputing) but from closest observations (How much observations? k) determined by nan-euclidean formuala and then take their mean and then simply impute the null value with it  
<br>

## Advantages & Disadvantages
* More accurate
* More Calculations
* Whole Training set has to be uploaded to server in production environmen: (speed slowed, memory usage more)

## Code


In [35]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer, KNNImputer

from sklearn.metrics import accuracy_score

In [36]:
df = pd.read_csv('assets/Titanic-Dataset.csv',usecols=['Age','Fare','Pclass','Survived'])
df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [37]:
df.isnull().mean()*100

Survived     0.00000
Pclass       0.00000
Age         19.86532
Fare         0.00000
dtype: float64

In [38]:
X = df.drop('Survived',axis=1)
y = df['Survived']

In [39]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

# KNNImputer `weights` Params 
```python
weights=['distance','uniform']#uniform is default
```

## Example:

We want to fill in a missing value (say, **Age**) using **KNNImputer**.  
We found our 3 closest neighbors and their distances:

| Neighbor | Distance from missing row | Age (value to use) |
|-----------|---------------------------|--------------------|
| 1 | 1.0 | 10 |
| 2 | 2.0 | 12 |
| 3 | 4.0 | 9 |

---

###  1. Uniform weighting (`weights='uniform'`)

All neighbors are treated equally — it doesn’t matter if they are close or far.  
We just take the mean of their “Age” values:

(10 + 12 + 9) / 3 = 31 / 3 = 10.33



 **Imputed Age = 10.33**

---

###  2. Distance weighting (`weights='distance'`)

Here, closer neighbors get **more importance** (more “weight”) than farther ones.

**Step 1: Inverse of distances**

| Neighbor | Distance | 1 / Distance |
|-----------|-----------|--------------|
| 1 | 1.0 | 1.00 |
| 2 | 2.0 | 0.50 |
| 3 | 4.0 | 0.25 |

**Step 2: Add them up**

Total = 1.00 + 0.50 + 0.25 = 1.75


**Step 3: Convert to proportions (weights)**

| Neighbor | 1/Distance | Weight (fraction of total) |
|-----------|-------------|----------------------------|
| 1 | 1.00 | 1.00 / 1.75 = 0.571 |
| 2 | 0.50 | 0.50 / 1.75 = 0.286 |
| 3 | 0.25 | 0.25 / 1.75 = 0.143 |

**Step 4: Multiply each value by its weight**

| Neighbor | Age | Weight | Contribution |
|-----------|------|---------|--------------|
| 1 | 10 | 0.571 | 5.71 |
| 2 | 12 | 0.286 | 3.43 |
| 3 | 9 | 0.143 | 1.29 |

**Step 5: Add contributions**    

Total = 5.71 + 3.43 + 1.29 = 10.43

**Imputed Age = 10.43**

---

### Compare results

| Method | Result | Explanation |
|---------|---------|-------------|
| **Uniform** | 10.33 | Simple average — all neighbors counted equally |
| **Distance** | 10.43 | Closer neighbors have slightly more influence |

---

### Key takeaway

- **Uniform weighting** → equal importance → simple average  
- **Distance weighting** → close neighbors have **more say** → weighted average  
- The difference is small here, but when distances vary a lot, distance-based gives a more realistic imputation.




In [68]:
knn = KNNImputer(n_neighbors=2,weights='uniform')# with weight='uniform' also giving same accuracy
X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [69]:
pd.DataFrame(X_train_trf,columns=X_train.columns).isnull().mean()*100 #nulls gone

Pclass    0.0
Age       0.0
Fare      0.0
dtype: float64

In [70]:
lr = LogisticRegression()
lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7094972067039106

In [71]:
# camparing with mean imputation

In [72]:
si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [73]:
lr = LogisticRegression()
lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978

Difference is clear