KNN算法：
KNN（K-Nearest Neighbors）算法是一种监督学习算法，通常用于分类和回归任务。它的基本思想是：对于一个新的样本，通过计算其与训练集中各样本的距离，选取K个最近邻样本，然后根据这K个样本的标签来预测新样本的标签。

KNN是一个懒惰的监督学习算法，因为它不在训练集上进行模型的训练，只是通过计算测试集样本到训练集样本最近的k个中，通过观察这k个训练样本的label，如果是分类就选k个中频数最多的label，如果是回归就取变量的平均值。

1. 导入库与数据集

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

dataset = pd.read_csv('./data/Social_Network_Ads.csv')
X = dataset.iloc[:, [2,3]].values
Y = dataset.iloc[:, 4].values
dataset.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


2. 将数据划分成训练集和测试集

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0) 

3. 特征缩放

In [5]:
## 为了保持数据一致性，要对x的不同变量进行标准化
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test) ## 测试集标准化的参数用训练集的

4. 使用K-NN对训练集数据进行训练

In [None]:
from sklearn.neighbors import KNeighborsClassifier
## 创建一个KNeighborsClassifier对象，取k=5，方法采用闵可夫斯基距离，指定p=2，则距离为欧几里得距离
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

5. 对测试集进行预测

In [8]:
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1])

6. 生成混淆矩阵

In [10]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm

array([[64,  4],
       [ 3, 29]])

补充7. 计算模型评估指标

In [11]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(accuracy, precision, recall, f1)

0.93 0.8787878787878788 0.90625 0.8923076923076924
