# k-Nearest Neighbor Classifier

KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase. This makes training faster and testing phase slower and costlier. Costly testing phase means time and memory. In the worst case, KNN needs more time to scan all data points and scanning all data points will require more memory for storing training data.

In [1]:
import pickle as pkl

with open('../data/titanic_tansformed.pkl', 'rb') as f:
    df_data = pkl.load(f)

In [2]:
df_data.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,2,3,male,Q,S
0,0,22.0,1,0,7.25,0,1,1,0,1
1,1,38.0,1,0,71.2833,0,0,0,0,0
2,1,26.0,0,0,7.925,0,1,0,0,1
3,1,35.0,1,0,53.1,0,0,0,0,1
4,0,35.0,0,0,8.05,0,1,1,0,1


In [3]:
df_data.shape

(889, 10)

In [4]:
data = df_data.drop("Survived",axis=1)
label = df_data["Survived"]

In [5]:
from sklearn.model_selection import train_test_split  
data_train, data_test, label_train, label_test = train_test_split(data, label, test_size = 0.2, random_state = 101)

In [6]:
from sklearn.neighbors import KNeighborsClassifier

import time

tic = time.time()
knn_cla = KNeighborsClassifier()
knn_cla.fit(data_train,label_train)
print('Time taken for training Decision Tree', (time.time()-tic), 'secs')

predictions = knn_cla.predict(data_test)
print('Accuracy', knn_cla.score(data_test, label_test))

from sklearn.metrics import classification_report, confusion_matrix                
print(confusion_matrix(label_test, predictions))  
print(classification_report(label_test, predictions)) 

Time taken for training Decision Tree 0.0039789676666259766 secs
Accuracy 0.7078651685393258
[[81 26]
 [26 45]]
             precision    recall  f1-score   support

          0       0.76      0.76      0.76       107
          1       0.63      0.63      0.63        71

avg / total       0.71      0.71      0.71       178



### Hyperparameters for kNN
- The hyperparameter is the number of neighbors that it should consider before classifying

In [8]:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

n_neighbors = [2,3,4,5,6,7,8, 9]
score_func = 'accuracy'

knn_cla = KNeighborsClassifier()
knn_grid = GridSearchCV(estimator=knn_cla, 
                    param_grid=[{'n_neighbors':n_neighbors}], 
                    cv=5, 
                    scoring=score_func)
knn_grid.fit(data_train, label_train)
print('Best Score', knn_grid.best_score_)
print('Best value for k', knn_grid.best_estimator_.n_neighbors)

Best Score 0.7187060478199718
Best value for k 8
