## k近邻法
### 三要素
> 1.k值的选择
> 2.距离度量
> 3.分类决策规则

## 距离
### $L_p$距离
$$L_p\left( x_i, x_j \right) = \left\lgroup \sum_{l=1}^{n}{\left|x_i^{(l)}-x_j^{(l)}\right|^p} \right\rgroup^\frac{1}{p} $$
### 欧式距离
当$L_p$距离的$p=2$时，称为欧式距离
$$ L_2\left( x_i, x_j \right) = \left\lgroup \sum_{l=1}^n \left| x_i^{(l)}-x_j^{(l)} \right|^2 \right\rgroup^{\frac{1}{2}} $$
### 曼哈顿距离
当$L_p$距离的$p=1$时，称为曼哈顿距离
$$ L_1\left( x_i, x_j \right) = \sum_{l=1}^n \left| x_i^{(l)}-x_j^{(l)} \right|  $$
### 切比雪夫距离
当$L_p$距离的$p=\infty$时，称为切比雪夫距离
$$ L_\infty\left( x_i, x_j \right) = \max_l \left| x_i^{(l)}-x_j^{(l)} \right| $$

## kd树


In [52]:
 from sklearn import neighbors
 import numpy as np
 
 x = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
 y = [1, 1]
 nbr = neighbors.NearestNeighbors(n_neighbors=2, algorithm='auto').fit(x)
 distance, indices = nbr.kneighbors(y)



### 参考
> https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/

In [35]:
## 实现（鸢尾花⚜️）
import csv
import random

import math

import operator

main()
def main():
    # prepare data
    filename = './data/iris.data'
    trainingSet = []
    testSet = []
    split = 0.67
    loadDataset(filename, split, trainingSet, testSet)
    print 'Train: ' + repr(len(trainingSet))
    print 'Test: ' + repr(len(testSet))
    
    # generate predictions
    predicttions = []
    k = 3
    for x in range(len(testSet)):
        neighbors = getNeighbors(trainingSet, testSet[x], k)
        result = getResponse(neighbors)
        predicttions.append(result)
        print('> predicted= ' + repr(result) + ', actual= ' + repr(testSet[x][-1]))
    
    accuracy = getAccuracy(testSet, predicttions)
    print('Accuracy: ' + repr(accuracy) + '%')

# loadDataset(filename, split, trainingSet, testSet)

# print 'Train: ' + repr(len(trainingSet))
# print 'Test: ' + repr(len(testSet))
# print trainingSet

# import train data
def loadDataset(filename, split, trainingSet=[], testSet=[]):
    with open(filename, 'rb') as csvfile:
        lines = csv.reader(csvfile)
        dataset = list(lines)
        for x in range(len(dataset)-1):
            for y in range(4):
                dataset[x][y] = float(dataset[x][y]) # string -> float
            if random.random() < split:  # divide train sets and test sets
                trainingSet.append(dataset[x])
            else:
                testSet.append(dataset[x])

# data1 = [2, 2, 2, 'a']
# data2 = [4, 4, 4, 'b']
# distance = enclideanDistance(data1, data2, 3)
# print 'Distance: ' + repr(distance)

# similar degree(L2)
def enclideanDistance(instance1, instance2, length):
    distance = 0
    for x in range(length):
        distance += pow( (instance1[x] - instance2[x]), 2)
    return math.sqrt(distance)

# trainSet = [ [2, 2, 2, 'a'], [4, 4, 4, 'b'] ]
# testInstance = [5, 5, 5]
# k = 1
# neighbors = getNeighbors(trainSet, testInstance, 1)
# print(neighbors)

# neighbors elements(Calculation all elements and get nearest k elements)
def getNeighbors(trainingSet, testInstance, k):
    distances = []
    length = len(testInstance)-1
    for x in range(len(trainingSet)):
        dist = enclideanDistance(testInstance, trainingSet[x], length)
        distances.append((trainingSet[x], dist))
    distances.sort(key=operator.itemgetter(1))
    neighbors = []
    for x in range(k):
        neighbors.append(distances[x][0])
    return neighbors

# neighbors = [ [1, 1, 1, 'a'], [2, 2, 2, 'a'], [3, 3, 3, 'b'], [4, 4, 4, 'b'] ]
# response = getResponse(neighbors)
# print(response)

# decision ruler(most vote)
def getResponse(neighbors):
    classVotes = {}
    for x in range(len(neighbors)):
        response = neighbors[x][-1]
        if response in classVotes:
            classVotes[response] += 1
        else:
            classVotes[response] = 1
    sortedVotes = sorted(classVotes.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

# testSet = [ [1, 1, 1, 'a'], [2, 2, 2, 'a'], [3, 3, 3, 'b'] ]
# predicttions = ['a', 'a', 'a']
# accuracy = getAccuracy(testSet, predicttions)
# print(accuracy)

# accuracy judge(calc correct sort in testsets)
def getAccuracy(testSet, predicttions):
    correct = 0
    for x in range(len(testSet)):
#         print id(testSet[x][-1]), id(predicttions[x])
#         if testSet[x][-1] is predicttions[x]:
        if testSet[x][-1] == predicttions[x]:
            correct += 1
    return (correct / float(len(testSet))) * 100.0





# Refer
# https://python.freelycode.com/contribution/detail/304

Train: 89
Test: 60
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', actual= 'Iris-setosa'
> predicted= 'Iris-setosa', act

训练集/测试集 ＝ 67/33

In [58]:
import tensorflow