# Hyper-Paramater-K(超参数)

- 超参数：在算法运行前需要决定的参数
- 模型参数：算法过程中学习到的参数

- kNN 算法没有模型参数，k 是 kNN 算法中的超参数

In [1]:
import numpy as np
from sklearn import datasets

In [2]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 333)

In [11]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

0.9833333333333333

In [15]:
KNeighborsClassifier?

### 1. 寻找最好的 k

In [9]:
best_score = 0.0
best_k = -1
for k in range(1, 11):
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_score = score
        best_k = k

print("best_k = ", best_k)
print("best_score = ", best_score)

best_k =  1
best_score =  0.9916666666666667


[![kNN.png](https://i.postimg.cc/zBtDyWz1/kNN.png)](https://postimg.cc/47cRMYH8)

#### 现在考虑这种情况
- 在上图中，红色里绿色最近，票数为1
- 蓝色票数为2，但是都比较远，此时如果只考虑 k 一个超参数，会把lvse点判给蓝色方
- 但是很明显红色离得最近，只考虑一个参数 k 会不会有些不合理呢？
- 在此， 我们引入第二个超参数 distance（距离）
- 在sklearn中的KNeighborsClassifier方法中，我们可以把参数 weight 值赋为 distance 即可
- 这样也解决了票数相同时的问题，当各方票数相同时，看距离的远近

### 2. 考虑距离？不考虑距离？

In [14]:
best_method = ""
best_score = 0.0
best_k = -1
for method in ["uniform", "distance"]:
    for k in range(1, 11):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_method = method
            best_score = score
            best_k = k

print("best_method = ", method)
print("best_k = ", best_k)
print("best_score = ", best_score)

best_method =  distance
best_k =  1
best_score =  0.9916666666666667


#### 明可夫斯基距离
[![distance.png](https://i.postimg.cc/d0dkz5jS/distance.png)](https://postimg.cc/f3zRdjFx)

- 当 p 为 1 时，为曼哈顿距离
- 当 p 为 2 时，为欧拉距离
- **此时我们又得到一个新的超参数 p**
- 在sklearn中的KNeighborsClassifier方法中，已经有参数p : integer, optional (default = 2)
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

### 3. 搜索明可夫斯基距离中最好的p

In [18]:
%%time

best_p = -1
best_score = 0.0
best_k = -1
for k in range(1, 11):
    for p in range(1, 6):
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p = p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_p = p
            best_score = score
            best_k = k

print("best_p = ", p)
print("best_k = ", best_k)
print("best_score = ", best_score)

best_p =  5
best_k =  1
best_score =  0.9916666666666667
Wall time: 25.5 s
