# 1- KNN算法

给定一个训练数据集，对新的输入实例，在训练数据集中找到与该实例最邻近的K个实例，这K个实例的多数属于某个类，就把该输入实例分类到这个类中，即“少数服从多数”。

In [8]:
import numpy as np

## 1.1- 最常见的距离算法

In [2]:
def get_dist(a_arr, b_arr, method=True):
    dis_arr = np.abs(a_arr - b_arr)
    if method:
        return np.sqrt(np.power(dis_arr, 2).sum())
    return np.sqrt(dis_arr.sum())

In [9]:
a_arr = np.array([1, 4])
b_arr = np.array([3, 4])

In [10]:
get_dist(a_arr, b_arr, method=True)

np.float64(2.0)

## 1.2- KNN算法描述

### 1.2.1- 算法描述

- 计算测试数据与各个训练数据之间的距离；

- 按照距离的递增关系进行排序；

- 选取距离最小的K个点；

- 确定前K个点所在类别的出现频率；

- 返回前K个点中出现频率最高的类别作为测试数据的预测分类。

### 1.2.2- 归一化的重要性
假设样本特征为 $\{(x_{i1}, x_{x2}, ..., x_{im})\}_{i=1}^{m} $, 每个特征的最值为:
$$ M_{j} = \max_{i=1,2,...m} x_{ij}  - \min_{i=1,2,...,m} x_{ij}$$

计算时, 将每个特征除以对应的 $M_{j}$ , 来进行归一化, 即

$$ d(Y, Z)  = \sqrt{\sum_{j=1}^{n}(\frac{y_j}{M_j} - \frac{z_j}{M_j})} $$

In [12]:
va = np.random.random((3, 2))
va.max(axis=0)

array([0.89127927, 0.41582134])

## 1.3- 实战例子

#### 首先导入需要的包

In [1]:
import numpy as np
import pandas as pd
import os, sys
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

#### 并进行数据导入和预处理

In [2]:
data_df = pd.read_csv("iris.csv") 

In [3]:
data_df.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,SETOSA
1,4.9,3.0,1.4,0.2,SETOSA
2,4.7,3.2,1.3,0.2,SETOSA
3,4.6,3.1,1.5,0.2,SETOSA
4,5.0,3.6,1.4,0.2,SETOSA


In [4]:
data_df['class'].unique()

array(['SETOSA', 'VERSICOLOR', 'VIRGINICA'], dtype=object)

In [5]:
# 分类
label_arr = data_df['class'].unique()
label_dict = {value:ix for ix, value in enumerate(label_arr)}
data_df['y_label'] = data_df['class'].map(lambda x: label_dict[x])

#### 划分训练集和测试集

In [6]:
x_data = data_df[[i for i in data_df.columns if i not in ['class', 'y_label']]]
y_data = data_df.y_label

x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=.2, random_state=42)

x_train.shape, x_test.shape

y_train.shape, y_test.shape

x_train

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
22,4.6,3.6,1.0,0.2
15,5.7,4.4,1.5,0.4
65,6.7,3.1,4.4,1.4
11,4.8,3.4,1.6,0.2
42,4.4,3.2,1.3,0.2
...,...,...,...,...
71,6.1,2.8,4.0,1.3
106,4.9,2.5,4.5,1.7
14,5.8,4.0,1.2,0.2
92,5.8,2.6,4.0,1.2


#### 方案1: 纯手写KNN算法

In [7]:
# 初值设置
k_value = 5

In [8]:
pred_list = []
for ix, test in x_test.iloc[:].iterrows():
    tmp_df = (x_train - test).apply(lambda x: x * x).sum(axis=1).sort_values()
    pred_label = y_train.loc[tmp_df.iloc[:k_value].index].value_counts().sort_values(ascending = False).index[0]
    pred_list.append(pred_label)

In [9]:
y_test_list = y_test.values.tolist()

In [10]:
# 分类预测效果
for a, b in zip(pred_list, y_test_list):
    print(a,b)

1 1
0 0
2 2
1 1
1 1
0 0
1 1
2 2
1 1
1 1
2 2
0 0
0 0
0 0
0 0
1 1
2 2
1 1
1 1
2 2
0 0
2 2
0 0
2 2
2 2
2 2
2 2
2 2
0 0
0 0


In [11]:
checkout_list = [1 if value == pred_list[ix] else 0 for ix, value in enumerate(y_test_list)]

In [12]:
sum(checkout_list) / len(checkout_list)

1.0

#### 方案2: 调用现成的sklearn模块

In [13]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import precision_score, recall_score, f1_score

In [14]:
knn = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')

In [15]:
knn.fit(x_train, y_train)

In [16]:
pred_test = knn.predict(X=x_test)

In [17]:
knn.score(X=x_test, y=y_test)

1.0

In [18]:
recall_score(y_test, pred_test,average='micro')

1.0