# K-Nearest Neighbor (최근접 이웃)
### - 예측 모델없이 최근접 값들을 기준으로 분류/회귀

In [2]:
import pandas as pd
import numpy as np
import multiprocessing
import matplotlib.pyplot as plt
plt.style.use(['seaborn-whitegrid'])

In [10]:
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris, load_breast_cancer
from sklearn.datasets import load_boston, fetch_california_housing
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline

## KNN Classification

### Data 가져오기

In [20]:
iris = load_iris()
iris

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [17]:
iris_df = pd.DataFrame(data=iris.data, columns = iris.feature_names)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [18]:
iris_df['Target'] = iris.target
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Train, Test Data 분리하기

In [27]:
X, y = load_iris(return_X_y=True)

In [28]:
# X

In [29]:
# y

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [35]:
print("총개수 : {0}, X_train 개수 : {1}, y_train 개수 : {2}, X_test 개수 : {3}".format(len(X), len(X_train), len(y_train), len(X_test)))

총개수 : 150, X_train 개수 : 120, y_train 개수 : 120, X_test 개수 : 30


### 전처리 (standardization)

In [36]:
scaler = StandardScaler()

In [44]:
X_train_scale = scaler.fit_transform(X_train)
X_test_scale = scaler.transform(X_test)

### Model 적용

In [45]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print("학습 데이터 점수 : {:2f}".format(model.score(X_train, y_train)))
print("평가 데이터 점수 : {:2f}".format(model.score(X_test, y_test)))

학습 데이터 점수 : 0.966667
평가 데이터 점수 : 1.000000


In [47]:
model = KNeighborsClassifier()
model.fit(X_train, y_train)
print("학습 데이터 점수 : {:2f}".format(model.score(X_train_scale, y_train)))
print("평가 데이터 점수 : {:2f}".format(model.score(X_test_scale, y_test)))

학습 데이터 점수 : 0.333333
평가 데이터 점수 : 0.333333


### Cross Validate 검증

In [68]:
cross_validate(
    estimator=KNeighborsClassifier(),
    X=X, y=y,
    cv=10,
    n_jobs=multiprocessing.cpu_count(),
    verbose=True)

[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   6 out of  10 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=8)]: Done  10 out of  10 | elapsed:    0.0s finished


{'fit_time': array([0.        , 0.        , 0.00587654, 0.002177  , 0.00260663,
        0.        , 0.        , 0.        , 0.00751305, 0.        ]),
 'score_time': array([0.00628829, 0.00847125, 0.00359607, 0.00484109, 0.00542307,
        0.01064873, 0.00808191, 0.03180218, 0.01347828, 0.01824856]),
 'test_score': array([1.        , 0.93333333, 1.        , 1.        , 0.86666667,
        0.93333333, 0.93333333, 1.        , 1.        , 1.        ])}

### 최적화 조건 검토

In [57]:
param_grid = [{'n_neighbors':[3, 5, 7],
              'weights':['uniform', 'distance'],
              'algorithm':['ball_tree', 'kd_tree', 'brute']}]

In [58]:
gs = GridSearchCV(
    estimator = KNeighborsClassifier(),
    param_grid = param_grid,
    n_jobs = multiprocessing.cpu_count(),
    verbose=True)

In [63]:
gs.fit(X, y)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done  52 tasks      | elapsed:    0.4s
[Parallel(n_jobs=8)]: Done  90 out of  90 | elapsed:    0.4s finished


GridSearchCV(estimator=KNeighborsClassifier(), n_jobs=8,
             param_grid=[{'algorithm': ['ball_tree', 'kd_tree', 'brute'],
                          'n_neighbors': [3, 5, 7],
                          'weights': ['uniform', 'distance']}],
             verbose=True)

In [64]:
gs.best_estimator_

KNeighborsClassifier(algorithm='ball_tree', n_neighbors=7)

In [66]:
print("GridSearchCF best score : {:4f}".format(gs.best_score_))

GridSearchCF best score : 0.980000


### 시각화