### K-최근접 이웃 알고리즘 (K-Nearest Neighbor, KNN)
- 임의의 데이터가 주어지면 그 주변(이웃)의 데이터를 살펴본 뒤 더 많은 데이터가 포함되어 있는 범주로 분류하는 방식이다.
- 가장 간단한 머신러닝 알고리즘으로서, 직관적이고 나름 성능도 괜찮다.
- K를 어떻게 정하는지에 따라서 결과값이 바뀔 수 있다. K는 임의의 데이터가 주어졌을 때 가까운 이웃들의 개수이고 기본값은 5이다.
- K는 가장 가까운 5개의 이웃 데이터를 기반으로 분류하며, 일반적으로 홀수를 사용한다. 짝수일 경우 동점이 되어 하나의 결과를 도출할 수 없기 때문이다.

<img src="./images/knn01.png" width="400px"> <img src="./images/knn02.png" width="400px">

- KNN은 fit을 통해 훈련시키면, 학습하지 않고 저장만 해놓는다. 따라서 이러한 모델을 Lazy Model이라고 부른다.
- 새로운 데이터가 주어지면 그제서야 이웃 데이터를 보고 분류해나간다. 따라서 사전 모델링이 필요없는 real-time 예측이 이루어진다.

<img src="./images/knn03.jpg" width="350px" style="margin-left: 20px;">

- 데이터와 데이터 사이의 거리를 구해야 더 가까운 클래스로 분류할 수 있으며,  
  이는 유클리드 거리(Euclidean Distance)방식과 맨해튼 거리(Manhattan Distance)방식이 있다.

In [1]:
import pandas as pd

c_df = pd.read_csv('./datasets/corona.csv', low_memory=False)
c_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278848 entries, 0 to 278847
Data columns (total 11 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Ind_ID               278848 non-null  int64 
 1   Test_date            278848 non-null  object
 2   Cough_symptoms       278596 non-null  object
 3   Fever                278596 non-null  object
 4   Sore_throat          278847 non-null  object
 5   Shortness_of_breath  278847 non-null  object
 6   Headache             278847 non-null  object
 7   Corona               278848 non-null  object
 8   Age_60_above         151528 non-null  object
 9   Sex                  259285 non-null  object
 10  Known_contact        278848 non-null  object
dtypes: int64(1), object(10)
memory usage: 23.4+ MB


In [2]:
c_df.isna().sum()

Ind_ID                      0
Test_date                   0
Cough_symptoms            252
Fever                     252
Sore_throat                 1
Shortness_of_breath         1
Headache                    1
Corona                      0
Age_60_above           127320
Sex                     19563
Known_contact               0
dtype: int64

In [12]:
c_df.duplicated().sum()

0

In [None]:
pre_c_df = c_df.copy()

In [4]:
columns = ['Test_date', 'Age_60_above', 'Sex', 'Ind_ID', 'Known_contact']

pre_c_df = c_df.drop(labels=columns, axis=1)

In [5]:
pre_c_df

Unnamed: 0,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona
0,True,False,True,False,False,negative
1,False,True,False,False,False,positive
2,False,True,False,False,False,positive
3,True,False,False,False,False,negative
4,True,False,False,False,False,negative
...,...,...,...,...,...,...
278843,False,False,False,False,False,positive
278844,False,False,False,False,False,negative
278845,False,False,False,False,False,negative
278846,False,False,False,False,False,negative


In [8]:
pre_c_df.isna().sum()

Cough_symptoms         252
Fever                  252
Sore_throat              1
Shortness_of_breath      1
Headache                 1
Corona                   0
dtype: int64

In [9]:
pre_c_df = pre_c_df[~ pre_c_df.Cough_symptoms.isna()]
pre_c_df = pre_c_df[~ pre_c_df.Fever.isna()]
pre_c_df = pre_c_df[~ pre_c_df.Sore_throat.isna()]
pre_c_df = pre_c_df[~ pre_c_df.Headache.isna()]

In [10]:
pre_c_df.isna().sum()

Cough_symptoms         0
Fever                  0
Sore_throat            0
Shortness_of_breath    0
Headache               0
Corona                 0
dtype: int64

In [11]:
pre_c_df.reset_index(drop=True, inplace=True)
pre_c_df

Unnamed: 0,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona
0,True,False,True,False,False,negative
1,False,True,False,False,False,positive
2,False,True,False,False,False,positive
3,True,False,False,False,False,negative
4,True,False,False,False,False,negative
...,...,...,...,...,...,...
278589,False,False,False,False,False,positive
278590,False,False,False,False,False,negative
278591,False,False,False,False,False,negative
278592,False,False,False,False,False,negative


In [13]:
pre_c_df.duplicated().sum()

278502

In [17]:
pre_c_df.drop_duplicates(inplace=True)

In [18]:
pre_c_df.reset_index(drop=True, inplace=True)

In [19]:
pre_c_df

Unnamed: 0,Cough_symptoms,Fever,Sore_throat,Shortness_of_breath,Headache,Corona
0,True,False,True,False,False,negative
1,False,True,False,False,False,positive
2,True,False,False,False,False,negative
3,True,False,False,False,False,other
4,False,False,False,False,False,negative
...,...,...,...,...,...,...
87,False,True,True,False,True,negative
88,True,True,True,True,True,other
89,True,False,True,True,True,other
90,True,True,False,True,True,other


In [20]:
pre_c_df.Corona.value_counts()

Corona
negative    32
positive    32
other       28
Name: count, dtype: int64

In [None]:
neg = pre_c_df[pre_c_df.Corona == 'negative'].sample(28, random_state=124)
pos = pre_c_df[pre_c_df.Corona == 'positive'].sample(28, random_state=124)


In [None]:
import numpy as np
pre_c_df[['Cough_sympotms', 'Fever', 'Sore_throat', 'Shortness_of_breath', 'Headache']] == \
pre_c_df[['Cough_sympotms', 'Fever', 'Sore_throat', 'Shortness_of_breath', 'Headache']].astype(np.int64)

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
targets = encoder.fit_transform(pre_c_df.Corona)
pre_c_df['Target'] = targets

In [None]:
pre_c_df.drop(label=['Corona'], axis=1, inplace=True)

In [None]:
encoder.classes_

In [None]:
pre_c_df.Target.value_counts()

In [None]:
pre_c_df.hist(figsize=(15, 15))

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier

features, targets = pre_c_df.iloc[:, :-1], pre_c_df.iloc[:, -1]

X_train, X_test, y_train, y_test = \
train_test_split(features, targets, stratify=targets, test_size=0.2, random_state=124)

knn_c = KNeighborsClassifier()

parameters = {
    'n_neighbors': [3, 5, 7]
}

g_knn_c = GridSearchCV(knn_c, param_grid=parameters, cv=5, refit=True, return_train_score=True)
g_knn_c.fit(X_train, y_train)

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, ConfusionMatrixDisplay

def get_evaluation(y_test, prediction, classifier=None, X_test=None):
    confusion = confusion_matrix(y_test, prediction)
    accuracy = accuracy_score(y_test , prediction)
    precision = precision_score(y_test , prediction)
    recall = recall_score(y_test , prediction)
    f1 = f1_score(y_test, prediction)
    auc = roc_auc_score(y_test, prediction)
    
    print('오차 행렬')
    print(confusion)
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f}, F1: {3:.4f}, ROC-AUC: {4:.4f}'.format(accuracy, precision, recall, f1, auc))
    print("#" * 80)
    
    if classifier is not None and  X_test is not None:
        fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12,4))
        titles_options = [("Confusion matrix", None), ("Normalized confusion matrix", "true")]

        for (title, normalize), ax in zip(titles_options, axes.flatten()):
            disp = ConfusionMatrixDisplay.from_estimator(classifier, X_test, y_test, ax=ax, cmap=plt.cm.Blues, normalize=normalize)
            disp.ax_.set_title(title)
        plt.show()

In [None]:
knn_c = g_knn_c.best_estimator_
prediction = knn_c.predict(X_test)

In [None]:
get_evaluation(y_test, prediction, knn_c, X_test)