# 11. KNN

## 1. KNN의 이해

주변 k개의 자료의 클래스 중 가장 많은 클래스로 특정자료의 클래스를 지정하는 방법 (Voting)  

KNN 근방 분류기법에서는 K값에 따라 그 결과가 달라지므로 적절한 K값을 찾는 것이 매우 중요

* 특징
    - 추정방법, 통계적 모형, 자료변환 등이 필요없으며,
    - 학습데이터만 있으면 되기 때문에 게으른 학습, 사례중심 학습으로 불리움.  
    - 하지만 입력변수의 개수가 증가하면 [차원의 저주](https://modern-manual.tistory.com/4) 문제가 발생
    


## 2. sklearn 모듈을 활용한 KNN

In [1]:
import seaborn as sns

In [2]:
df = sns.load_dataset('iris')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


k근방 분류기법은 출력변수의 값을 분류하여 예측하며 새로운 모형에 적용했을 때 예측성능이 떨어지는 단점이 있다.    

### 2-1. Train 데이터와 Test 데이터 분할

sklearn.model_selection 모듈의 train_test_split()함수 이용

> from sklearn.model_selection import train_test_split

> train 객체명, test 객체명 = train_test_split(데이터셋, test_size=비율, random_state=번호, stratify=출력변수)

예측의 대상이 되는 출력변수에 대하여 층화추출을 통해 Train 데이터와 Test 데이터를 분류

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
train, test = train_test_split(df, test_size=0.3, random_state=1, stratify=df['species'])

In [6]:
train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
33,5.5,4.2,1.4,0.2,setosa
20,5.4,3.4,1.7,0.2,setosa
115,6.4,3.2,5.3,2.3,virginica
124,6.7,3.3,5.7,2.1,virginica
35,5.0,3.2,1.2,0.2,setosa


In [7]:
test.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
148,6.2,3.4,5.4,2.3,virginica
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
106,4.9,2.5,4.5,1.7,virginica
75,6.6,3.0,4.4,1.4,versicolor


In [8]:
y_train=train['species']
x_train=train[['sepal_length','sepal_width','petal_length','petal_width']]

y_test=test['species']
x_test=test[['sepal_length','sepal_width','petal_length','petal_width']]

In [9]:
y_train.head()

33        setosa
20        setosa
115    virginica
124    virginica
35        setosa
Name: species, dtype: object

In [10]:
x_train.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
33,5.5,4.2,1.4,0.2
20,5.4,3.4,1.7,0.2
115,6.4,3.2,5.3,2.3
124,6.7,3.3,5.7,2.1
35,5.0,3.2,1.2,0.2


### 2-2. k 근방 분류기법 실시

sklearn.model_selection 모듈의 KNeighborsClassifier()함수 이용

> from sklearn.neighbors import KNeighborsClassifier

> KNeighborsClassifier(n_neighbors=k값, p=민콥스키 모수).fit(입력변수, 출력변수)

* n_neighbors 옵션: Voting을 위한 인접한 데이터의 수인 값을 지정. 
* p 옵션: 민콥스키 모수 지정(데이터간 거리를 측정), p=1=맨하탄 거리, p=2=유클리디안 거리

> 예측값: knn객체.predict(입력변수)

KNeighborsClassifier().fit() 함수를 실행시킨 결과를 객체에 할당

k근방 분류기법 모형평가에 사용되는 분류표, 정분류유르 평가지표를 확인하기 위하여 sklearn.metrics 모듈을 이용한다.   

> from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

> 분류표: confusion_matrix(출력변수 예측값, 출력변수)

> 정분류율: accuracy_score(출력변수, 출력변수 예측값)

> 분류 예측력 평가 지표: classification_report(출력변수, 출력변수 예측값)

**k근방 분류기법은 거리를 측정하여 데이터간 인접성을 판단하므로, 데이터의 단위가 매우 중요** 따라서 단위의 크기가 매우 다르면 분석결과가 좋지 않으므로 따라서 필요에 따라 입력변수의 표준화 작업이 요구됨.   

> from sklearn.preporcessing import StandardScaler

> 객체명=StandardScaler().fit(Train데이터 입력변수)

> Train 데이터 표준화: 객체명.transform(Train 데이터 입력변수)

> Test 데이터 표준화: 객체명.transform(Test 데이터 입력변수)

표준화한 결과가 크게 차이 나지 않는다면, 모형해석의 용이성 때문에 표준화하지 않은 결과를 사용.   



In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
knn = KNeighborsClassifier(n_neighbors=5, p=2).fit(x_train, y_train) # p=2 유클리디안 거리

In [13]:
y_train_pred = knn.predict(x_train)

In [14]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

In [15]:
confusion_matrix(y_train_pred, y_train)

array([[35,  0,  0],
       [ 0, 33,  0],
       [ 0,  2, 35]])

In [16]:
accuracy_score(y_train_pred, y_train)

0.9809523809523809

In [17]:
knn_report = classification_report(y_train, y_train_pred)

In [18]:
print(knn_report)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        35
  versicolor       1.00      0.94      0.97        35
   virginica       0.95      1.00      0.97        35

    accuracy                           0.98       105
   macro avg       0.98      0.98      0.98       105
weighted avg       0.98      0.98      0.98       105



In [19]:
y_test_pred = knn.predict(x_test)

In [20]:
confusion_matrix(y_test_pred, y_test)

array([[15,  0,  0],
       [ 0, 15,  1],
       [ 0,  0, 14]])

In [21]:
accuracy_score(y_test_pred, y_test)

0.9777777777777777

In [22]:
knn_report1 = classification_report(y_test, y_test_pred)

In [23]:
print(knn_report1)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.94      1.00      0.97        15
   virginica       1.00      0.93      0.97        15

    accuracy                           0.98        45
   macro avg       0.98      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45



#### 입력변수 표준화

In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
# '.fit()' computes the mean and std to be used for later scaling
sc = StandardScaler().fit(x_train)

표준화를 위한 평균과 표준편차를 계산한 결과를 sc객체에 할당

In [26]:
x_train_std = sc.transform(x_train)

In [27]:
x_test_std = sc.transform(x_test)

In [28]:
from sklearn.neighbors import KNeighborsClassifier

In [29]:
knn1=KNeighborsClassifier(n_neighbors=5, p=2).fit(x_train_std, y_train)

In [30]:
y_train_std_pred = knn1.predict(x_train_std)

In [31]:
confusion_matrix(y_train_std_pred, y_train)

array([[35,  0,  0],
       [ 0, 33,  2],
       [ 0,  2, 33]])

In [32]:
accuracy_score(y_train_std_pred, y_train)

0.9619047619047619

In [33]:
y_test_std_pred = knn1.predict(x_test_std)

In [34]:
confusion_matrix(y_test_std_pred, y_test)

array([[15,  0,  0],
       [ 0, 13,  1],
       [ 0,  2, 14]])

In [35]:
accuracy_score(y_test_std_pred, y_test)

0.9333333333333333

In [36]:
knn1_report1 = classification_report(y_test, y_test_std_pred)

In [37]:
print(knn1_report1)

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        15
  versicolor       0.93      0.87      0.90        15
   virginica       0.88      0.93      0.90        15

    accuracy                           0.93        45
   macro avg       0.93      0.93      0.93        45
weighted avg       0.93      0.93      0.93        45



## 실습

### [과제11] k근방 분류기법의 실습

seaborn모듈의 penguins데이터에서 species를 예측하기 위해 bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g을 사용한 k근방 분류기법을 실시

In [38]:
df = sns.load_dataset('penguins')

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [40]:
df = df.dropna(subset=['species', 'bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g'], 
               how='any', axis=0)

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 342 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            342 non-null    object 
 1   island             342 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 21.4+ KB


In [42]:
train, test = train_test_split(df, test_size=0.3, random_state=1, stratify=df['species'])

In [43]:
y_train = train['species']
X_train = train[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']]

y_test = test['species']
X_test = test[['bill_length_mm','bill_depth_mm','flipper_length_mm','body_mass_g']]

In [44]:
X_train.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
99,43.2,18.5,192.0,4100.0
188,47.6,18.3,195.0,3850.0
81,42.9,17.6,196.0,4700.0
133,37.5,18.5,199.0,4475.0
100,35.0,17.9,192.0,3725.0


In [45]:
y_train.head()

99        Adelie
188    Chinstrap
81        Adelie
133       Adelie
100       Adelie
Name: species, dtype: object

In [46]:
knn = KNeighborsClassifier(n_neighbors=3, p=2).fit(X_train, y_train)

In [47]:
y_train_pred = knn.predict(X_train)

In [48]:
confusion_matrix(y_train_pred, y_train)

array([[100,  15,   4],
       [  2,  29,   1],
       [  4,   3,  81]])

In [49]:
accuracy_score(y_train_pred, y_train)

0.8786610878661087

In [50]:
knn_report = classification_report(y_train, y_train_pred)

In [51]:
print(knn_report)

              precision    recall  f1-score   support

      Adelie       0.84      0.94      0.89       106
   Chinstrap       0.91      0.62      0.73        47
      Gentoo       0.92      0.94      0.93        86

    accuracy                           0.88       239
   macro avg       0.89      0.83      0.85       239
weighted avg       0.88      0.88      0.87       239



In [52]:
y_test_pred = knn.predict(X_test)

In [53]:
confusion_matrix(y_test_pred, y_test)

array([[42, 16,  3],
       [ 1,  4,  1],
       [ 2,  1, 33]])

In [54]:
accuracy_score(y_test_pred, y_test)

0.7669902912621359

In [55]:
knn_report1 = classification_report(y_test, y_test_pred)

In [56]:
print(knn_report1)

              precision    recall  f1-score   support

      Adelie       0.69      0.93      0.79        45
   Chinstrap       0.67      0.19      0.30        21
      Gentoo       0.92      0.89      0.90        37

    accuracy                           0.77       103
   macro avg       0.76      0.67      0.66       103
weighted avg       0.77      0.77      0.73       103



In [57]:
sc = StandardScaler().fit(X_train)

In [58]:
X_train_std = sc.transform(X_train)

In [59]:
X_test_std = sc.transform(X_test)

In [60]:
knn1=KNeighborsClassifier(n_neighbors=3, p=2).fit(X_train_std, y_train)

In [61]:
y_train_std_pred = knn1.predict(X_train_std)

In [62]:
confusion_matrix(y_train_std_pred, y_train)

array([[106,   1,   0],
       [  0,  46,   0],
       [  0,   0,  86]])

In [63]:
accuracy_score(y_train_std_pred, y_train)

0.99581589958159

In [64]:
knn1_report = classification_report(y_train, y_train_std_pred)

In [65]:
print(knn1_report)

              precision    recall  f1-score   support

      Adelie       0.99      1.00      1.00       106
   Chinstrap       1.00      0.98      0.99        47
      Gentoo       1.00      1.00      1.00        86

    accuracy                           1.00       239
   macro avg       1.00      0.99      0.99       239
weighted avg       1.00      1.00      1.00       239



표준화하기 전보다 훨씬 더 정확한 결과가 나오는 것을 확인할 수 있음

In [66]:
y_test_std_pred = knn1.predict(X_test_std)

In [67]:
confusion_matrix(y_test_std_pred, y_test)

array([[44,  2,  0],
       [ 1, 19,  0],
       [ 0,  0, 37]])

In [68]:
accuracy_score(y_test_std_pred, y_test)

0.970873786407767

In [69]:
knn1_report1 = classification_report(y_test, y_test_std_pred)

In [70]:
print(knn1_report1)

              precision    recall  f1-score   support

      Adelie       0.96      0.98      0.97        45
   Chinstrap       0.95      0.90      0.93        21
      Gentoo       1.00      1.00      1.00        37

    accuracy                           0.97       103
   macro avg       0.97      0.96      0.96       103
weighted avg       0.97      0.97      0.97       103



표준화를 실시한 후 knn기법을 사용하는 것이 더 효과적임.