KNN

[![](https://miro.medium.com/max/506/0*QyWp7J6eSz0tayc0.png)](https://velog.io/@khsfun0312/KNN)

<br>

* KNN 알고리즘
 - 가장 간단한 지도학습 머신러닝 알고리즘
 - 훈련데이터를 저장해 두는 것이 모델을 만드는 과정의 전부임

 - 새로운 데이터가 입력되면 그 새로운 데이터 주변의 가장 가까운 K개의 훈련데이터의 레이블을 확인한 뒤, 가장 많이 보이는 라벨로 분류하는 방법

<br>

* K의 결정
 - KNN에서 K의 결정은 매우 중요한 문제임
 - K가 작으면 이상점 등의 노이즈에 민감하게 반응하는 과적합의 문제
 - K가 크면 자료의 패턴을 잘 파악할 수 없어 예측 성능이 저하됨
 - 검증용(validation) 데이터를 이용하여 주어진 훈련 데이터에 가장 적절한 K를 찾아야 함

<br>

* 거리의 측정

<br>

[![](https://user-images.githubusercontent.com/3907357/103730278-d60b2a00-5025-11eb-8f8e-c26719ec3bc2.png)](https://github.com/jangsoohoon/recommend_system/wiki/%EB%B9%84%EC%8A%B7%ED%95%9C-%EC%BB%A8%ED%85%90%EC%B8%A0-%EC%B0%BE%EB%8A%94-%EB%B0%A9%EB%B2%95)

[![](https://velog.velcdn.com/images%2Fsloools%2Fpost%2Fb2407e65-9b11-450b-ba08-56a72628d8b0%2Fimage.png)](https://velog.io/@sloools/%ED%94%84%EB%A1%9C%EA%B7%B8%EB%9E%98%EB%A8%B8%EC%8A%A4-%EC%B9%B4%EC%B9%B4%EC%98%A4-%ED%82%A4%ED%8C%A8%EB%93%9C-%EB%88%84%EB%A5%B4%EA%B8%B0Python3-%EC%A2%8C%ED%91%9C-%EA%B1%B0%EB%A6%AC-%EA%B5%AC%ED%95%98%EA%B8%B0)

<br>

* 거리의 측정
 - 자료에 스케일에 차이가 있는 경우, 스케일이 큰 특성변수에 의해 거리가 결정되어 버릴 수 있음. 
 - 따라서 특성변수 별로 스케일이 유사해지도록 표준화(Z score) 변환 또는 min-max 변환으로 스케일링을 해준 뒤 거리를 재는 것이 적절함

## 1. 데이터 준비/ 기본 설정

In [1]:
### 기본 라이브러리 불러오기
import pandas as pd
import seaborn as sns

# load_dataset 함수를 사용하여 데이터프레임으로 변환
df = sns.load_dataset('titanic')

In [2]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 2. 데이터 탐색

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


category  데이터 타입은 범주형 데이터타입임  
파이썬에서 제공하는 타입은 아니고 판다스에서 제공하는 데이터 타입  

In [6]:
# NaN 값이 많은 deck 열을 삭제, embarked와 내용이 겹치는 embark_town 열을 삭제
rdf = df.drop(['deck', 'embark_town'], axis=1)
rdf.columns.values

array(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'alive', 'alone'],
      dtype=object)

In [7]:
# age 열에 나이 데이터가 없는 모든 행을 삭제 - age 열 (891개 중 177 개의 NaN 값)
rdf = rdf.dropna(subset=['age'], how='any', axis=0)
rdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 714 entries, 0 to 890
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   survived    714 non-null    int64   
 1   pclass      714 non-null    int64   
 2   sex         714 non-null    object  
 3   age         714 non-null    float64 
 4   sibsp       714 non-null    int64   
 5   parch       714 non-null    int64   
 6   fare        714 non-null    float64 
 7   embarked    712 non-null    object  
 8   class       714 non-null    category
 9   who         714 non-null    object  
 10  adult_male  714 non-null    bool    
 11  alive       714 non-null    object  
 12  alone       714 non-null    bool    
dtypes: bool(2), category(1), float64(2), int64(4), object(4)
memory usage: 63.6+ KB


In [8]:
rdf['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [9]:
rdf['embarked'].value_counts()

S    554
C    130
Q     28
Name: embarked, dtype: int64

In [10]:
# embarked 열의 NaN 값을 승선도시 중에서 가장 많이 출현한 값으로 치환하기
most_freq = rdf['embarked'].value_counts().idxmax()
most_freq

'S'

In [11]:
rdf['embarked'].fillna(most_freq, inplace=True)

In [12]:
df.describe(include='all')    # include='all' 을 사용하면 문자열과 같이 연산할 수 없는 것들도 보여줌

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
count,891.0,891.0,891,714.0,891.0,891.0,891.0,889,891,891,891,203,889,891,891
unique,,,2,,,,,3,3,3,2,7,3,2,2
top,,,male,,,,,S,Third,man,True,C,Southampton,no,True
freq,,,577,,,,,644,491,537,537,59,644,549,537
mean,0.383838,2.308642,,29.699118,0.523008,0.381594,32.204208,,,,,,,,
std,0.486592,0.836071,,14.526497,1.102743,0.806057,49.693429,,,,,,,,
min,0.0,1.0,,0.42,0.0,0.0,0.0,,,,,,,,
25%,0.0,2.0,,20.125,0.0,0.0,7.9104,,,,,,,,
50%,0.0,3.0,,28.0,0.0,0.0,14.4542,,,,,,,,
75%,1.0,3.0,,38.0,1.0,0.0,31.0,,,,,,,,


## 3. 분석에 사용할 속성을 선택

In [13]:
# 분석에 활용할 열(속성)을 선택
ndf = rdf[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'embarked']]
ndf.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked
0,0,3,male,22.0,1,0,S
1,1,1,female,38.0,1,0,C
2,1,3,female,26.0,0,0,S
3,1,1,female,35.0,1,0,S
4,0,3,male,35.0,0,0,S


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [15]:
# 원핫인코딩 - 범주형 데이터를 모형이 인식할 수 있도록 숫자형으로 변환
onehot_sex = pd.get_dummies(ndf['sex'])
onehot_sex.head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [16]:
ndf = pd.concat([ndf, onehot_sex], axis=1)
ndf.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,female,male
0,0,3,male,22.0,1,0,S,0,1
1,1,1,female,38.0,1,0,C,1,0
2,1,3,female,26.0,0,0,S,1,0
3,1,1,female,35.0,1,0,S,1,0
4,0,3,male,35.0,0,0,S,0,1


In [17]:
onehot_embarked = pd.get_dummies(ndf['embarked'], prefix='town')
onehot_embarked.head()

Unnamed: 0,town_C,town_Q,town_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [18]:
ndf = pd.concat([ndf, onehot_embarked], axis=1)
ndf.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,embarked,female,male,town_C,town_Q,town_S
0,0,3,male,22.0,1,0,S,0,1,0,0,1
1,1,1,female,38.0,1,0,C,1,0,1,0,0
2,1,3,female,26.0,0,0,S,1,0,0,0,1
3,1,1,female,35.0,1,0,S,1,0,0,0,1
4,0,3,male,35.0,0,0,S,0,1,0,0,1


In [19]:
ndf.drop(['sex', 'embarked'], axis=1, inplace=True)
ndf.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,female,male,town_C,town_Q,town_S
0,0,3,22.0,1,0,0,1,0,0,1
1,1,1,38.0,1,0,1,0,1,0,0
2,1,3,26.0,0,0,1,0,0,0,1
3,1,1,35.0,1,0,1,0,0,0,1
4,0,3,35.0,0,0,0,1,0,0,1


## 4. 데이터셋 구분 - 훈련용(train data)/ 검증용(test data)

In [20]:
# 속성(변수) 선택
X = ndf[list(ndf.columns)[1:]]    # 독립 변수 X
y = ndf[list(ndf.columns)[0]]

In [21]:
X.shape

(714, 9)

In [22]:
y.shape

(714,)

In [23]:
X.head()

Unnamed: 0,pclass,age,sibsp,parch,female,male,town_C,town_Q,town_S
0,3,22.0,1,0,0,1,0,0,1
1,1,38.0,1,0,1,0,1,0,0
2,3,26.0,0,0,1,0,0,0,1
3,1,35.0,1,0,1,0,0,0,1
4,3,35.0,0,0,0,1,0,0,1


In [24]:
# 특성 정규화(normalization)
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit(X).transform(X)

In [25]:
X[:, 4][:5]

array([-0.75905134,  1.31743394,  1.31743394,  1.31743394, -0.75905134])

In [26]:
# train data와 test data로 구분(7:3 비율)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.3,
                                                    random_state=10)

In [28]:
print('train data 개수:', X_train.shape)
print('test data 개수: ', X_test.shape)

train data 개수: (499, 9)
test data 개수:  (215, 9)


In [30]:
print('y_train data 개수:', y_train.shape)
print('y_test data 개수: ', y_test.shape)

y_train data 개수: (499,)
y_test data 개수:  (215,)


## 5. KNN 분류 모형 - sklearn 사용

In [31]:
# sklearn 라이브러리에서 KNN 분류 모형 가져오기
from sklearn.neighbors import KNeighborsClassifier

# 모형 객체 생성(k=5로 설정)
knn = KNeighborsClassifier(n_neighbors=5)

# train data를 가지고 모형 학습
knn.fit(X_train, y_train)

KNeighborsClassifier()

* n_neighbor(default=5): 분류 시 고려할 인접 샘플 수
* weight(default='uniform'): 'distance'로 설정하면, 분류할 때 인접한 샘플의 거리에 따라 다른 가중치 부여(가까울수록 큰 가중치)
* metric(default='minkowski'): 거리계산의 척도(minkowski, euclidean, mahalanobis 등)

In [32]:
# test data를 가지고 y_hat을 예측(분류)
y_hat = knn.predict(X_test)

In [33]:
y_hat.shape

(215,)

In [34]:
y_test.shape

(215,)

In [35]:
import numpy as np
y_test = np.array(y_test)
y_test

array([0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0])

In [39]:
import pandas as pd
df = pd.DataFrame({'y_hat': y_hat, 'y': y_test})

In [40]:
df.head()

Unnamed: 0,y_hat,y
0,0,0
1,0,0
2,1,1
3,0,0
4,0,0


In [41]:
df['차이'] = df['y_hat'] == df['y']
df.head()

Unnamed: 0,y_hat,y,차이
0,0,0,True
1,0,0,True
2,1,1,True
3,0,0,True
4,0,0,True


In [42]:
len(df)

215

In [43]:
sum(df['차이'])

177

In [45]:
print('accuracy: ', sum(df['차이']) / len(df))

accuracy:  0.8232558139534883


In [48]:
# 모형 성능 평가 - Confusion Matrix 계산
from sklearn import metrics
knn_matrix = metrics.confusion_matrix(y_test, y_hat)
knn_matrix

array([[111,  14],
       [ 24,  66]])

In [49]:
# 모형 성능 평가 - 평가지표 계산
knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)

              precision    recall  f1-score   support

           0       0.82      0.89      0.85       125
           1       0.82      0.73      0.78        90

    accuracy                           0.82       215
   macro avg       0.82      0.81      0.82       215
weighted avg       0.82      0.82      0.82       215



In [50]:
# 인접한 k개의 sample에 대해 거리와 index를 반환
knn.kneighbors()

(array([[2.90414419, 3.03676623, 3.26298986, 3.41669105, 3.54246655],
        [0.        , 0.06888798, 0.06888798, 0.06888798, 0.06888798],
        [0.        , 0.        , 0.        , 0.        , 0.        ],
        ...,
        [0.        , 0.06888798, 0.06888798, 0.06888798, 0.13777595],
        [0.96443168, 1.22229197, 1.22229197, 1.22229197, 1.24207642],
        [1.2434624 , 1.29579097, 1.51838311, 1.51838311, 1.51838311]]),
 array([[246, 142, 297,  89, 340],
        [ 49, 191,  25,  77, 258],
        [ 59, 339, 256,  13, 320],
        ...,
        [216, 274,  75,  54, 192],
        [486,   9, 172, 208, 360],
        [368, 277,  84, 426, 447]]))