## K Nearest Neighbour Algorithm

- Is a supervised learning algorithm
- KNN classifier can classify examples by assigning them the class of the most similar examples.
- simple but extremely poswerful algorithm
- Well suited where relationships between the samples are very difficult to understand.

### How it works?
- If we have a new example, KNN algorithm first identifies k elements in the training datasets that are nearest in similarity.
- The unlabelled test example is assigned to the class of the majority of the k nearest neighbours.

- KNN does not build a model during training. It simply stores the training data.
- When a new data point arrives, KNN **computes distances** to all stored data points to find the nearest neighbors. All the "work" happens during prediction, so its slow at prediction(distance computation is expensive).Training is extremely fast.
- Non parametric learning.



Distance

Manhattan:
dist(x,y)=|x1-x2|+|y1-y2|

Euclidean:
dist(x,y)=sqrt((x1-x2)^2+(y1-y2)^2)

Manhattan scales well with number of dimensions

## Bias Variance Tradeoff

High bias = underfitting, misses the trends in the training data. When we use too simple models.
bias is the error from incorrect assumptions from the algorithm.

Variance= overfitting, error from the sensitivity to random fluctuations in the dataset.
model learns the noise instead of actual relationships between the variables in the data(of course, it is not present in the test set)


## How to choose optimal K value?


 Small k (e.g., k=1)
- Model is **very sensitive** to noise.
- **Low bias, high variance**.
- **Overfitting** risk — captures noise as patterns.

Large k (e.g., k ≈ total data points)
- Model becomes **too smooth**.
- **High bias, low variance**.
- **Underfitting** — misses important patterns.

 Choosing the Right k
- Use **cross-validation** to select optimal `k`.
- Odd `k` values help avoid ties in binary classification.


In [1]:
import pandas as pd
df=pd.read_csv('data/credit_data.csv')
df.dropna(inplace=True)
df.head()


Unnamed: 0,clientid,income,age,loan,default
0,1,66155.9251,59.017015,8106.532131,0
1,2,34415.15397,48.117153,6564.745018,0
2,3,57317.17006,63.108049,8020.953296,0
3,4,42709.5342,45.751972,6103.64226,0
4,5,66952.68885,18.584336,8770.099235,1


In [2]:
X=df[['income','age','loan']]
y=df['default']

### Transformations

Why transformations? The distance formula depends on how features are measured. If certain features have much larger values than the others, the distance measurements will strongly be dominated by that value. So we need to rescale features such that each one contributes equally to the distance formula.

- Min Max Normalization: transforms all the values between 0 and 1.

x=(x-xmin)/(xmax-xmin)

- Z Score Normalization: used so that the feature is fitted into a normal distribution as 

x=(x-mean(x))/std(x)

for pca we use z score normalization

In [3]:
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
X_scaled=scaler.fit_transform(X)
X_scaled

array([[0.9231759 , 0.95743135, 0.58883739],
       [0.28812165, 0.86378597, 0.47682695],
       [0.74633429, 0.99257918, 0.58262011],
       ...,
       [0.48612202, 0.69109837, 0.40112895],
       [0.47500998, 1.        , 0.1177903 ],
       [0.98881367, 0.93282208, 0.53597028]], shape=(1997, 3))

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

In [5]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=20)
knn.fit(X_train,y_train)
predicted=knn.predict(X_test)

In [6]:
predicted

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [7]:
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, predicted))
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predicted))


Accuracy: 0.975
[[508   1]
 [ 14  77]]


### Choosing k with cross validation

In [9]:
from sklearn.model_selection import cross_val_score
scores=[]
for k in range(1, 100):
    knn = KNeighborsClassifier(n_neighbors=k)
    cross_val_scores = cross_val_score(knn, X_scaled, y, cv=10)
    scores.append(cross_val_scores.mean())
print("Best k:", scores.index(max(scores)) + 1)


Best k: 23


In [10]:
knn=KNeighborsClassifier(n_neighbors=23)
knn.fit(X_train,y_train)
predicted=knn.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, predicted))
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, predicted))

Accuracy: 0.9783333333333334
[[508   1]
 [ 12  79]]
