# AI: K-NN

Build a classification system using k-Nearest Neighbors method from scratch to determine class/label of data in file `DataTest_KNN.csv` and use `DataTrain_KNN.csv` file to train the k-NN model.

## Import Necessary Library

In [77]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RepeatedStratifiedKFold

## Read the Data Train

In [11]:
data_train = pd.read_csv('DataTrain_KNN.csv')
print(data_train.shape)
data_train.head()

(800, 7)


Unnamed: 0,Index,X1,X2,X3,X4,X5,Y
0,1,-1.608052,-0.377992,1.204209,1.313808,1.218265,1
1,2,0.393766,0.630685,-1.222062,0.090558,0.015893,0
2,3,-0.466243,0.276972,2.519047,0.673745,0.16729,1
3,4,1.47121,-0.046791,-0.303291,-0.365437,1.989287,0
4,5,-1.672906,1.25588,-0.355706,0.123143,-2.241941,1


In [12]:
X = data_train[data_train.columns[1:-1]]
y = data_train.Y

## KNN Algorithm

before we implement our k-NN algorithm to train the data train, we create several functions to help our k-NN algorithm

### Euclidiean Distance Function

$ d(p,q) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$

In [27]:
def euclidean(a, b):
    sum = 0
    for i in range(len(a)):
        sum += (a[i]-b[i])**2
    return np.sqrt(sum)

test the euclidean function

In [30]:
list_a = [1,2,3]
list_b = [5,6,7]

euclidean(list_a,list_b)

6.928203230275509

### K-NN Function

determine the nearest neighbor(s) to classify the data

In [84]:
def k_nearest_neighbors(k_value, X_train, y_train, X_test):
    y_predict = []
    for index, test in X_test.iterrows():
        list_distance = []
        for idx, train in X_train.iterrows():
            list_distance.append(euclidean(test, train))
        
        top_k = np.argsort(list_distance)[:k_value]
        y_predict.append(stats.mode(y_train.iloc[top_k].values)[0])
        
    return y_predict

## Hyperparameters Tuning

### Observe The Best k
search the k value between 1-25 inclusive

In [88]:
rskf = RepeatedStratifiedKFold(n_splits=4, n_repeats=2, random_state=32)

In [None]:
k_performance = []

for k_value in range(2,25):
    temp_performance = []
    for train_index, val_index in rskf.split(X, y):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        
        predict_of_y = k_nearest_neighbors(k_value, X_train, y_train, X_val)
        temp_performance.append(accuracy_score(predict_of_y, y_val))
        
    k_performance.append(np.mean(temp_performance))

### Get the Best K

we observe the best k between value of 1-25 using accuracy metrics.

In [None]:
print("K terbaik:",k_performance.index(max(k_performance))+1)
print("Akurasi K terbaik:",k_performance[k_performance.index(max(k_performance))]*100,"%")

### Graph of Accuracy

this graph describe all accuracy and the k

In [None]:
plt.plot([x for x in range(1,25)], k_performance)
plt.show()

## Data Test Classification Process

 ### Read the Data Test

In [None]:
data_test = pd.read_csv('DataTest_KNN.csv')
data_test.head()

### Label Prediction for Data Test

after we have the best k, now we predict the class using k-NN algorithm with the best k

In [None]:
k = 8
y_prediction = k_nearest_neighbors(k, data_test, data_train)

### Write the Result

save the class prediction