### KNN分类模型
- K-邻近算法测量不同特征值之间的距离，然后进行分类
- 决策规则
  - 第一，确定距离度量；
  - 第二，k值的选择（找出训练集中与带估计点最靠近的k个实例点）；
  - 第三，分类决策规则。
- k值的作用




- 在**分类** 任务中可使用“投票法”，即选择这k个实例中出现最多的标记类别作为预测结果；
- 在**回归** 任务中可使用“平均法”，即将这k个实例的实值输出标记的平均值作为预测结果；
- 还可基于距离远近进行加权平均或加权投票，距离越近的实例权重越大
- 优点：精度高、对异常值不敏感、无数据输入假定。
- 缺点：时间复杂度高、空间复杂度高

- 欧氏距离
  - dist(X,Y) = $\sqrt{\sum_{i=1}^n(x_i - y_i)^2}$

- k值的选择
- 常用的方法：
  1. 从k=1开始，使用检验集估计分类器的误差率。
  2. 重复该过程，每次K增值1，允许增加一个近邻。
  3. 选取产生最小误差率的K。

- 注意：
  1. 一般k的取值不超过20，上限是n的开方，随着数据集的增大，K的值也要增大。
  2. 一般k值选取比较小的数值，并采用交叉验证法选择最优的k值。


### 如何进行电影分类
- 众所周知，电影可以按照题材分类，然而题材本身是如何定义的?由谁来判定某部电影属于哪个题材?也就是说同一题材的电影具有哪些公共特征?这些都是在进行电影分类时必须要考虑的问题。没有哪个电影人会说自己制作的电影和以前的某部电影类似，但我们确实知道每部电影在风格上的确有可能会和同题材的电影相近。那么动作片具有哪些共有特征，使得动作片之间非常类似，而与爱情片存在着明显的差别呢？动作片中也会存在接吻镜头，爱情片中也会存在打斗场景，我们不能单纯依靠是否存在打斗或者亲吻来判断影片的类型。但是爱情片中的亲吻镜头更多，动作片中的打斗场景也更频繁，基于此类场景在某部电影中出现的次数可以用来进行电影分类。

### 鸢尾花分类的实现

In [1]:
import numpy as np
import pandas as pd
import sklearn.datasets as datasets
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
#1.捕获鸢尾花数据
iris= datasets.load_iris()

In [3]:
#2.提取样本数据
feature = iris['data']
target = iris['target']

In [4]:
#3.数据集拆分
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)

In [5]:
#4.观察数据集，看是否需要进行特征工程处理
x_train.shape

(120, 4)

In [6]:
#5.实例化模型对象
knn = KNeighborsClassifier(n_neighbors=3) #n_neighbor == k k取值不同，分类结果不同

In [7]:
#6.使用训练集训练模型
#X:训练集特征数据，特征数据维度必须是二维
#y:训练集标签数据
knn = knn.fit(x_train,y_train)
knn

In [8]:
#7.测试模型
y_pred = knn.predict(x_test)
y_pred

array([2, 0, 1, 1, 1, 1, 2, 1, 0, 0, 2, 1, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       2, 0, 2, 1, 1, 1, 0, 0])

In [9]:
y_true = y_test
y_true

array([2, 0, 1, 1, 1, 2, 2, 1, 0, 0, 2, 2, 0, 2, 2, 0, 1, 1, 2, 0, 0, 2,
       1, 0, 2, 1, 1, 1, 0, 0])

In [10]:
knn.score(x_test,y_test)

0.9

### 预测年收入是否大于50k美元

In [11]:
import pandas as pd
import numpy as np
import sklearn.datasets as datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [12]:
#1.加载数据
df = pd.read_csv('./datasets/adults.txt',header=None)
df.columns=['age','workclass','final_weight','education','education_num','marital_status','ocupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','salary']
df.head(3)


Unnamed: 0,age,workclass,final_weight,education,education_num,marital_status,ocupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [None]:
df = pd.read_csv('./datasets/adults.txt',header=None)'D:\coder\python_program\Books.csv'

In [13]:
#2.提取样本数据
target = df['salary']
feature = df[['age','education_num','ocupation','hours_per_week']]
feature

Unnamed: 0,age,education_num,ocupation,hours_per_week
0,39,13,Adm-clerical,40
1,50,13,Exec-managerial,13
2,38,9,Handlers-cleaners,40
3,53,7,Handlers-cleaners,40
4,28,13,Prof-specialty,40
...,...,...,...,...
32556,27,12,Tech-support,38
32557,40,9,Machine-op-inspct,40
32558,58,9,Adm-clerical,40
32559,22,9,Adm-clerical,20


In [14]:
occ_one_hot = pd.get_dummies(feature['ocupation'],dtype=int)
feature = pd.concat((feature,occ_one_hot),axis=1).drop(labels='ocupation',axis=1)
feature

Unnamed: 0,age,education_num,hours_per_week,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
0,39,13,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,50,13,13,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
2,38,9,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,53,7,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,28,13,40,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,12,38,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
32557,40,9,40,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
32558,58,9,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
32559,22,9,20,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [15]:
#3.数据集拆分
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.1,random_state=2020)

In [16]:
#4.观察数据集，看是否需要进行特征工程处理
#occ_one_hot = pd.get_dummies(x_train['ocupation'],dtype=int)
#x_train = pd.concat((x_train,occ_one_hot),axis=1).drop(labels='ocupation',axis=1)

In [17]:
#occ_one_hot_test = pd.get_dummies(x_test['ocupation'],dtype=int)
#x_test = pd.concat((x_test,occ_one_hot_test),axis=1).drop(labels='ocupation',axis=1)

In [18]:
x_test

Unnamed: 0,age,education_num,hours_per_week,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
2995,39,7,30,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2417,24,10,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
7874,32,10,50,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
5518,39,13,35,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
11025,48,13,60,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9588,53,14,40,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
18971,32,13,48,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
23606,18,9,25,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
12667,20,8,35,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [19]:
x_train

Unnamed: 0,age,education_num,hours_per_week,?,Adm-clerical,Armed-Forces,Craft-repair,Exec-managerial,Farming-fishing,Handlers-cleaners,Machine-op-inspct,Other-service,Priv-house-serv,Prof-specialty,Protective-serv,Sales,Tech-support,Transport-moving
11456,52,9,50,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
3739,49,5,40,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
26197,42,9,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
19836,34,9,40,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
11844,31,9,40,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11971,46,9,40,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
14966,20,9,40,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
7491,25,9,25,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
29064,25,13,40,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
#5.实例化模型对象
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(x_train,y_train)

In [21]:
knn.score(x_test,y_test)
#报错解决办法
#pip uninstall scikit-learn
#pip install scikit-learn==1.2.2

AttributeError: 'Flags' object has no attribute 'c_contiguous'

### 学习曲线寻找最优k值

In [None]:
scores = []
ks = []
for i in range(5,50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train,y_train)
    score = knn.score(x_test,y_test)
    scores.append(score)
    ks.append(i)

In [None]:
#scores_arr = np.array(scores)
#ks_arr = np.array(ks)

In [None]:
import matplotlib.pyplot as plt
plt.plot(ks,scores)
plt.xlabel('k_values')
plt.ylabel('score')

In [None]:
#找出最大值
scores_arr.max()

In [None]:
scores_arr.argmax()

In [None]:
ks_arr[33]

In [None]:
knn = KNeighborsClassifier(n_neighbors=38).fit(x_train,y_train)
knn

In [None]:
knn.score(x_test,y_test)

### k-近邻算法之约会网站配对效果判定（dating，TestSet.txt)

In [None]:
import pandas as pd
import numpy as np
import sklearn.datasets as datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('./datasets/datingTestSet.txt',header=None,sep='\s+')
df

In [None]:
feature = df[[0,1,2]]
target = df[3]

In [None]:
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)

In [None]:
x_train

In [None]:
#进行特征工程，归一化或者标准化
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler()
m_x_train = mm.fit_transform(x_train)
m_x_train

In [None]:
m_x_test = mm.fit_transform(x_test)
m_x_test

In [None]:
scores = []
ks = []
for i in range(2,50):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(m_x_train,y_train)
    score = knn.score(m_x_test,y_test)
    scores.append(score)
    ks.append(i)

In [None]:
scores_arr = np.array(scores)
ks_arr = np.array(ks)

In [None]:
import matplotlib.pyplot as plt
plt.plot(ks_arr,scores_arr)
plt.xlabel('k_values')
plt.ylabel('score')

In [None]:
scores_arr.argmax()

In [None]:
scores_arr.max()

In [None]:
ks_arr[1]

In [None]:
knn = KNeighborsClassifier(n_neighbors=3).fit(m_x_train,y_train)

In [None]:
knn.score(m_x_test,y_test)

### k取值问题：学习曲线&交叉验证选取k值
- k值较小，模型复杂度较高，容易发生过拟合，学习的**估计误差**会增大，预测结果对近邻的实例点非常敏感
- k值较大，可以减小学习的估计误差，学习的**近似误差**会增大，与输入实例较远的训练实例也会对预测起作用，使预测发生错误，k值增大模型的复杂度会下降
- 在应用中，一般取一个较小的值，交叉验证选取最优k值
- 适用场景：样本为几千，几万


### K折交叉验证
- 目的： 选出最合适的模型超参数的取值
- 将样本的训练数据交叉的拆分出不同的训练集和验证集，分别测试模型精度，然后求出精准度的均值是这次交叉验证的结果，将交叉验证作用到不同的超参数中，选出精准度最高的超参数作为模型创建的超参数

- 实现思路
  - 将数据集平均分割成k个等份
  - 使用1份数据作为测试数据，其余作为训练数据
  - 计算测试准确率
  - 使用不同的测试集，重复2，3步骤
  - 求准确率平均值，作为对未知数据预测的准确率估计

- API
  - from sklearn.model_selection import cross_val_score
  - cross_aval_score(estimator,X,y,cv)
    - estimator:模型对象
    - X:训练特征数据
    - y:训练目标数据
    - cv:折数

- 交叉验证在KNN中的基本使用

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
iris = datasets.load_iris()
feature = iris['data']
target = iris['target']
#拆分
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)
#对训练集进行交叉验证
cross_val_score(knn,x_train,y_train,cv=5)

In [None]:
#求均值
cross_val_score(knn,x_train,y_train,cv=5).mean()

- KNN中k值为5，不一定是最佳选择

In [None]:
scores = []
ks = []
iris = datasets.load_iris()
feature = iris['data']
target = iris['target']
#拆分
x_train,x_test,y_train,y_test = train_test_split(feature,target,test_size=0.2,random_state=2020)
for k in range(3,20):
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn,x_train,y_train,cv=6).mean()
    scores.append(score)
    ks.append(k)
plt.plot(ks,scores)

- 交叉验证也可以帮助我们进行模型选择，分别使用iris数据,KNN和logisitic回归模型进行模型比较和选择

In [None]:
from sklearn.linear_model import LogisticRegression
knn = KNeighborsClassifier(n_neighbors=5)
print(cross_val_score(knn,x_train,y_train,cv=5).mean())
Ir = LogisticRegression()
print(cross_val_score(Ir,x_train,y_train,cv=8).mean())

### K-Fold & cross_val_score
- Scikit中提供了K-Fold的API
  - n-split就是折数
  - shuffle：是否对数据洗牌
  - random_state: 随机种子

In [None]:
from numpy import array
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

data = array([0.1,0.2,0.3,0.4,0.5,0.6])
kfold = KFold(n_splits = 3,shuffle=True, random_state=1)

for train,test in kfold.split(data):
    print('train: %s,test: %s' %(data[train],data[test]))

- scikit中提取带K-Fold接口的交叉验证接口
  - sklearn.model_selection.cross_validate
  - 但接口没有shuffle功能，需要配合kfold一起使用
  - 如果训练数据在分组前经过了shuffle处理，比如使用train_test_split分组，就可以直接使用cross_val_score接口

In [None]:
from sklearn.model_selection import cross_val_score
iris = datasets.load_iris()
X,y = iris.data,iris.target

knn = KNeighborsClassifier(n_neighbors=5)

n_folds = 5
kf = KFold(n_folds = 5,shuffle=True, random_state=42).get_n_splits(X)
scores = cross_val_score(knn,X,y,cv=kf)

scores.mean()