# K-Nearest-Neighbor

## KNN简介

&emsp; 给定一个训练数据集，对新的输入实例，在训练数据集中找到与该实例最近邻的k个实例，这k个实例的多数属于某个类，就把该输入实例分为这个类。

## 距离度量

&emsp; 常采用欧式距离

<center>$L_2(x_i, x_j)=(\sum_{l=1}^{n}{|x_i^{(l)}-x_j^{(l)}|^2})^\frac{1}{2}$</center>

## k值的选择

&emsp; 如果选择较小的k值，就相当于用较小的领域中的训练实例进行预测，近似误差(approximation)会减小，只有与输入实例较近的训练实例才会对预测起作用，缺点是估计误差(estimation error)会增大，预测结果对近邻的实例点非常敏感。

&emsp; 如果选择较大的k值，就相当于用较大邻域中的训练实例进行预测，优点是可以减少学习的估计误差，缺点是学习的近似误差会增大。

## KNN的实现：kd树

&emsp; kd树是一种对k维空间中的实例点进行存储以便对其进行快速检索的二叉树数据结构，kd树表示对k维空间的一个划分，构造kd树相当于不断地用垂直于坐标轴的超平面将k维空间划分，构成一系列k维超矩形区域，kd树每个节点对应于一个k维超矩形区域。

&emsp; 构造平衡kd树：令$x_i=(x_i^{(1)}, x_i^{(2)}, \cdots , x_i^{(k)})^T$：

&emsp; 首先选择$x^{(1)}$为坐标轴，将所有实例的$x^{(1)}$的中位数为切分点，将根节点对应的超矩形区域切分成两个子区域。切分由通过切分点并与坐标轴$x^{(1)}$垂直的超平面实现。由根节点生成深度为1的左、右子节点：左子节点对应坐标$x^{(1)}$小于切分点的子区域，右子节点对应坐标$x^{(1)}$大于切分点的子区域。

&emsp; 重复以下步骤：对深度为j的节点选择$x^{(l)}$为切分的坐标轴，$l=j(mod k)+1$，重复上述操作直到两个子区域没有实例存在时停止。

&emsp; 搜索kd树：

&emsp; 首先在kd树中找出包含目标点x的叶节点：从根节点出发，递归地向下访问kd树，直到子节点为叶节点为止，以此叶节点为“当前最近点”。

&emsp; 递归地向上回退，在每个结点处，若该节点保存的实例点比当前最近点距离目标点更近，则该实例点变为当前最近点。同时检查另一子节点对应区域是否与以目标点为球心、以目标点与当前最近点间距离为半径的超球体相交，若相交，则在另一子节点对应区域内可能存在更近的点，进行搜索；若不相交，向上回退，回到根节点时，搜索结束。

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits

%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

In [2]:
X, y = load_digits(return_X_y=True)
print(X.shape, y.shape)

(1797, 64) (1797,)


In [3]:
class KNearestNeighbor(object):
  """ a kNN classifier with L2 distance """

  def __init__(self):
    pass

  def train(self, X, y):
    self.X_train = X
    self.y_train = y
    
  def predict(self, X, k=1):
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    M = np.dot(X, self.X_train.T)
    A1 = np.sum(np.square(X), axis=-1)
    A2 = np.sum(np.square(self.X_train), axis=-1)
    dists = np.sqrt(- 2 * M + A1.reshape(-1, 1) + A2.T)
    return self.predict_labels(dists, k=k)

  def predict_labels(self, dists, k=1):
    num_test = dists.shape[0]
    y_pred = np.zeros(num_test)
    for i in range(num_test):
      ind = np.argsort(dists[i], axis=-1)[:k]
      labels = self.y_train[ind].flatten()
      y_pred[i] = np.argmax(np.bincount(labels))
    return y_pred

In [4]:
from sklearn.neighbors import KNeighborsClassifier

In [5]:
X_train, y_train = X[: 1697], y[: 1697]
X_test, y_test = X[1697:], y[1697:]

neigh = KNeighborsClassifier(n_neighbors=10)
model = KNearestNeighbor()
model.train(X_train, y_train)
neigh.fit(X_train, y_train)

print(np.sum(model.predict(X_test, 5) == y_test) / y_test.shape[0], neigh.score(X_test, y_test))

0.99 0.97


In [6]:
class KDNode():
    def __init__(self, index, x, left, right):
        self.index = index
        self.x = x
        self.left = left
        self.right = right

class KDTree():
    def __init__(self):
        self.root = None
    
    def fit(self, X):
        self.root = self._create_node(X, 0)
        return
    
    def _create_node(self, X, index):
        N, D = X.shape
        if not N:
            return None
        X = X[X[:, index].argsort()]
        pos = N // 2
        ind = (index + 1) % D
        return KDNode(index, X[pos], self._create_node(X[:pos], ind), self._create_node(X[pos+1:], ind))
        
    def find_nearest(self, x):
        if not self.root:
            print("no training")
            return
        return self._search(x, self.root)
        
    def _search(self, x, node):
        if not node:
            return None, np.inf
        index = node.index
        point = node.x
        value = point[index]
        
        if x[node.index] <= value:
            nearest, near_distance = self._search(x, node.left)
            other = node.right
        else:
            nearest, near_distance = self._search(x, node.left)
            other = node.left
        
        if near_distance < np.abs(x[node.index] - value):
            return nearest, near_distance
        else:
            distance = np.linalg.norm(x - point)
            if distance < near_distance:
                nearest = point
                near_distance = distance
            other_point, other_distance = self._search(x, other)
            if other_distance < near_distance:
                return other_point, other_distance
            return nearest, near_distance

In [7]:
from time import clock

N = 400000
t0 = clock()
kd = KDTree()
kd.fit(np.random.randn(N, 3))
result = kd.find_nearest([0.1,0.5,0.8])
t1 = clock()
print ("time: ",t1-t0, "s")
print (result)

time:  24.21884052249912 s
(array([-1.12635325, -0.00344532,  0.13906767]), 1.4812936983871998)
