# [ K 최근접 이웃 알고리즘(K-NN) ] 
<img src="https://user-images.githubusercontent.com/119478998/209499317-b69cea19-91f9-447e-bdf6-b5691b06ed6c.png" width="600" height="300" />

- K-NN(K-Nearest Neighbors) 알고리즘은 지도학습(Supervised Learning)의 한 종류이다. 새로운 데이터 포인트에 대해 예측할 때 훈련 데이터셋에서 가장 가까운 데이터 포인트, 즉 '최근접 이웃'을 찾는 알고리즘이다.

- 행 데이터 간 거리를 '특징값의 차이'로 정의하고, 그 거리가 가장 가까운 행 데이터 k개의 목적변수로부터 회귀 및 분류를 실시


## K란?
- 이웃(주변 데이터)의 개수
- K는 홀수로 설정하는 것을 추천
- 기본값(default=5)을 이용해 적절히 찾아나가야한다.
  - K값이 너무 큰 경우 : 분류/예측 자체가 둔감해므로 underfitting 우려
  - K값이 너무 작은 경우 : 민감도가 높아 잘못 예측할 확률이 높아짐. overfitting 우려

## 변수 값 범위 재조정
- 각 feature의 단위가 다르기 때문에 거리 측정 시 범위를 재조정해야한다.
- 값의 규모가 큰 특징이 지나치게 중시되지 않도록 특징에 표준화 등 스케일링을 해주어야한다.

## 진행 순서
1. New Data와 모든 Train Set사이의 거리 측정
2. 유클리드 거리가 가까운 순으로 K개의 점을 이웃으로 갖는다.
3. K의 개의 점들 중 가장 많이 속한 Class 를 찾는다.
4. NewData를 새로운 Class 에 할당한다.

- KNN 알고리즘에서 최근접 이웃간 거리를 계산할 때 유클리드 거리, 맨하탄 거리, 민코우스키 거리 등을 사용할 수 있다.
  - 기본적으로 유클리드 거리(Euclidean distance) 기준으로 거리를 정의한다.

---

## K-최근접 이웃 '분류'
- 분류 : 가장 가까운 k개의 행 데이터에서 가장 많은 클래스를 예측값으로 함
- k값에 따라 다른 class로 할당됨

<img src="https://user-images.githubusercontent.com/119478998/209499850-c6751908-a1eb-455c-86dc-d8578c6993b5.png" width="400" height="300"/>

- k=3 이라면 class B로 분류될 수 있고, k=6이라면 class A로 분류될 수 있다.

In [None]:
# KNeighborsClassifier( ) : Classification model 이용

## K-최근접 이웃 '회귀'
- 회귀일 때는 가장 가까운 행 데이터 k개의 목적변수의 평균
- 단점 : 데이터 범위 밖의 새로운 데이터는 예측 불가능 ? 무슨 뜻이지 정확히

---

## Exercise : Iris data

In [10]:
import pandas as pd
from sklearn.datasets import load_iris

In [11]:
iris = load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [13]:
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df["target"] = iris.target
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [29]:
X,y = load_iris(return_X_y=True)

# 일반화 성능 평가를 위한 데이터 분리
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [30]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

x_train_scale = scaler.fit_transform(x_train)
x_test_scale = scaler.transform(x_test)

In [31]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

model = KNeighborsClassifier()
model.fit(x_train, y_train) # 스케일 전

print(f'Train Data Score: {model.score(x_train, y_train)}')
print(f'Test Data Score: {model.score(x_test, y_test)}')

Train Data Score: 0.9833333333333333
Test Data Score: 0.9666666666666667


In [32]:
model = KNeighborsClassifier()
model.fit(x_train_scale, y_train) # 스케일 후

print(f'Train Data Score: {model.score(x_train_scale, y_train)}')
print(f'Test Data Score: {model.score(x_test_scale, y_test)}')

Train Data Score: 0.9583333333333334
Test Data Score: 0.9333333333333333


In [34]:
from sklearn.model_selection import cross_validate, GridSearchCV
import multiprocessing

cross_validate(estimator = KNeighborsClassifier(), 
               X=x, y=y, cv=5, 
               n_jobs=multiprocessing.cpu_count())

{'fit_time': array([0.00300312, 0.00151229, 0.00151229, 0.0019927 , 0.00355721]),
 'score_time': array([0.00398755, 0.00606513, 0.00606513, 0.00398755, 0.0049839 ]),
 'test_score': array([0.96666667, 1.        , 0.93333333, 0.96666667, 1.        ])}

In [35]:
param_grid = [{'n_neighbors':[3,5,7],
               'weights': ['uniform', 'distance'],
               'algorithm': ['ball_tree', 'kd_tree', 'brute']}]

gs = GridSearchCV(
    estimator = KNeighborsClassifier(),
    param_grid = param_grid,
    n_jobs=multiprocessing.cpu_count(),
    verbose = True 
)

gs.fit(x,y)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


In [None]:
gs.best_estimator_

In [None]:
print(f'GridSearchCV best score: {gs.best_score_}')

In [None]:
def make_meshgrid(x, y, h=.02):
  x_min, x_max = x.min()-1, x.max()+1
  y_min, y_max = y.min()-1, y.max()+1
  xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                       np.arange(y_min, y_max, h))
  return xx, yy

def plot_contours(clf, xx, yy, **params):
  Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
  Z = Z.reshape(xx.shape)
  out = plt.contourf(xx, yy, Z, **params)
  return out

In [None]:
# 2차원으로 차원축소후 KNeighborsClassifier 를 이용해 학습 결과를 확인해 보자. 
tsne = TSNE(n_components=2) 
x_comp = tsne.fit_transform(x)

In [None]:
iris_comp_df = pd.DataFrame(data=x_comp)
iris_comp_df['Target'] = y
iris_comp_df

In [None]:
plt.scatter(x_comp[:, 0], x_comp[:, 1],
            c=y, 
            cmap = plt.cm.coolwarm,
            s=20, edgecolors='k')

In [None]:
# 2차원 데이터로 학습
model = KNeighborsClassifier()
model.fit(x_comp, y)
predict = model.predict(x_comp)

In [None]:
xx, yy = make_meshgrid(x_comp[:, 0], x_comp[:, 1])
plot_contours(model, xx, yy, cmap=plt.cm.coolwarm, alpha=.8)
plt.scatter(x_comp[:, 0], x_comp[:, 1], c=y, cmap=plt.cm.coolwarm, s=20, edgecolors='k');

---

## Exercise : breast cancer data

In [26]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

In [27]:
cancer_df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
cancer_df['Target'] = cancer.target
cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,Target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [None]:
X, y = cancer.data, cancer.target
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)