# KNN (Tugas 3)

## K-Nearest Neighbor

algoritma KNN merupakan algoritma klasifikasi yang bekerja dengan mengambil sejumlah K data terdekat (tetangganya) sebagai acuan untuk menentukan kelas dari data baru. Algoritma ini mengklasifikasikan data berdasarkan similarity atau kemiripan atau kedekatannya terhadap data lainnya.



### Menghitung jarak dengan Euclidean Distance


Untuk menghitung jarak antara dua titik pada algoritma KNN digunakan metode Euclidean Distance yang dapat digunakan pada 1-dimensional space, 2-dimensional space, atau multi-dimensional space.

1-dimensional space berarti perhitungan jarak hanya menggunakan satu variabel bebas (independent variable), 

2-dimensional-space berarti ada dua variabel bebas, dan multi-dimensional space berarti ada lebih dari dua variabel.

### Secara umum, formula Euclidean distance pada 1-dimensional space adalah sebagai berikut.


$$
dis \left ( x_{1},x_{2}\right )=\sqrt{\sum_{i=0}^{n_{}}\left (x_{1i}-x_{2i} \right )^{2}}
$$
Formula di atas dapat digunakan jika jumlah independent variable hanya ada satu variabel. Lalu, bagaimana jika ada banyak variabel yang digunakan?

### Jika ada lebih dari satu, kita dapat menjumlahkannya seperti di bawah ini.


$$
dis =\sqrt{\sum_{i=0}^{n_{}}\left (x_{1i}-x_{2i} \right )^{2}+\left (y_{1i}-y_{2i} \right )^{2}+...}
$$

## Reading in the training data

In [None]:
import pandas as pd
#read in the data using pandas
df = pd.read_csv("https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv")

#check data has been read in properly
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [None]:
df.shape

(150, 5)

## Split up the dataset into inputs and targets

In [None]:
#create a dataframe with all training data except the target column
X = df.drop(columns=["variety"])
#check that the target variable has been removed
X.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Split the dataset into train and test data

In [None]:
#separate target values
y = df["variety"].values
#view target values
y[0:5]

array(['Setosa', 'Setosa', 'Setosa', 'Setosa', 'Setosa'], dtype=object)

In [None]:
from sklearn.model_selection import train_test_split
#split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

## Building and training the model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Create KNN classifier
knn = KNeighborsClassifier(n_neighbors = 11)
# Fit the classifier to the data
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=11)

## Testing the model

In [None]:
#show first 5 model predictions on the test data
knn.predict(X_test)[0:5]

array(['Virginica', 'Setosa', 'Versicolor', 'Setosa', 'Setosa'],
      dtype=object)

In [None]:
#check accuracy of our model on the test data
knn.score(X_test, y_test)

0.9666666666666667

## k-Fold Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np
#create a new KNN model
knn_cv = KNeighborsClassifier(n_neighbors=3)
#train model with cv of 5 
cv_scores = cross_val_score(knn_cv, X, y, cv=5)
#print each cv score (accuracy) and average them
print(cv_scores)
print('cv_scores mean:{}'.format(np.mean(cv_scores)))


[0.96666667 0.96666667 0.93333333 0.96666667 1.        ]
cv_scores mean:0.9666666666666668


## Hypertuning model parameters using GridSearchCV

In [None]:
from sklearn.model_selection import GridSearchCV
#create new a knn model
knn2 = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
param_grid = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn2, param_grid, cv=5)
#fit model to data
knn_gscv.fit(X, y)

GridSearchCV(cv=5, estimator=KNeighborsClassifier(),
             param_grid={'n_neighbors': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24])})

In [None]:
#check top performing n_neighbors value
knn_gscv.best_params_

{'n_neighbors': 6}

In [None]:
#check mean score for the top performing value of n_neighbors
knn_gscv.best_score_

0.9800000000000001