## 1. KNN算法概述

K近邻算法是有监督学习算法，算法核心思想是使用预测样本和当前已有样本集进行特征近距离匹配，即选择K个特征距离最小的已有样本，然后统计K个已选样本的种类，最后将预测样本归类于出现次数最大的样本。对于回归问题处理方式相同，也是通过计算特征距离求测试样本归属，最后选取K个已有样本，并求其回归值得均值。

### KNN 算法图
<img src='./KNN.jpg' style='zoom:60%;float:left'>


### KNN算法距离计算
<img src="Distance.png" style="width:400px;height:80px;float:left">  

如果将样本特征看成一个向量，则样本之间的距离计算通过相关向量相关程度计算。
<img src='./vector_correlation.jpg' style='zoom:80%;float:left'>

## 2.KNN算法实例-根据房间数特征的房屋价格预测
<img src='./room_price_predict.png' style='float:left'>

## 3单个变量的KNN算法

In [3]:
# 单变量的距离计算
import pandas as pd
import numpy as np
features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']
dc_listings = pd.read_csv('listings.csv') #读取原有数据
dc_listings = dc_listings[features]     #从原有数据中取出指定特征
our_acc_value = 3   #输入预测房屋个数值
dc_listings['distance'] = np.abs(dc_listings.accommodates - our_acc_value) #计算距离
dc_listings.distance.value_counts().sort_index()  #将索引重新排序
dc_listings = dc_listings.sample(frac=1,random_state=0) #重新洗牌
dc_listings = dc_listings.sort_values('distance')    #按照distance选项重新排序
dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float) #将价格变为float类型
mean_price = dc_listings.price.iloc[:5].mean()      #取前五个数据并求出其均值作为预测值
mean_price

88.0

## 4. 多变量的KNN算法
多变量的KNN算法通过scipy.spatial的距离计算distance求得

In [9]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
features_labels = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']
features=['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

dc_listings = pd.read_csv('listings.csv')

dc_listings = dc_listings[features_labels]

dc_listings['price'] = dc_listings.price.str.replace("\$|,",'').astype(float) #字符串转换

dc_listings = dc_listings.dropna()    #去掉空值

dc_listings[features] = StandardScaler().fit_transform(dc_listings[features]) #数据标准化处理

normalized_listings = dc_listings   #将指定特征数据取出

norm_train_df = normalized_listings.copy().iloc[0:2792]
norm_test_df = normalized_listings.copy().iloc[2792:]

In [10]:
from scipy.spatial import distance

def predict_price_multivariate(new_listing_value,feature_columns):
    temp_df = norm_train_df                       #已经存在的样本数据
    temp_df['distance'] = distance.cdist(temp_df[feature_columns],[new_listing_value[feature_columns]])    #进行多变量的距离计算
    temp_df = temp_df.sort_values('distance')
    knn_5 = temp_df.price.iloc[:5]
    predicted_price = knn_5.mean()
    return(predicted_price)

cols = ['accommodates', 'bathrooms']  #指定计算变量
norm_test_df['predicted_price'] = norm_test_df[cols].apply(predict_price_multivariate,feature_columns=cols,axis=1)  #对于测试数据进行预测操作
norm_test_df['squared_error'] = (norm_test_df['predicted_price'] - norm_test_df['price'])**(2)  
mse = norm_test_df['squared_error'].mean()     
rmse = mse ** (1/2)   #计算价格误差的均方差
print(rmse)

108.56809421175991


In [11]:
norm_test_df.head(5)

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,predicted_price,squared_error
2839,-1.095648,-0.249501,-0.439211,-0.546933,25.0,-0.341421,-0.016575,-0.482571,76.0,2601.0
2840,-0.596625,-0.249501,-0.439211,-0.546933,60.0,-0.341421,-0.016606,-0.106278,143.8,7022.44
2841,0.40142,-0.249501,-0.439211,-0.546933,149.0,-0.341421,-0.016575,-0.345737,106.8,1780.84
2842,0.900443,-0.249501,1.26517,-0.546933,136.0,1.316824,-0.016606,-0.482571,297.8,26179.24
2843,-0.596625,-1.439006,-0.439211,-0.546933,90.0,-0.341421,-0.016575,-0.208903,143.8,2894.44


## 5.使用sklearn工具包实现多变量的KNN算法

In [12]:
from sklearn.neighbors import KNeighborsRegressor
cols = ['accommodates','bedrooms']
knn = KNeighborsRegressor()
knn.fit(norm_train_df[cols], norm_train_df['price'])
two_features_predictions = knn.predict(norm_test_df[cols])

from sklearn.metrics import mean_squared_error
two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

115.8952229715371


## 6. 实例总结
```
对于KNN算法的优缺点：
优点：计算简单
缺点：
精度不高：对于数据的特征而言，实际上每一个特征的贡献价值各不相同，但是在KNN计算时，每一个特征的贡献价值都是按照同等对待
计算量较大：每预测一格样本都需要对所有已经存在的样本进行一次遍历求距离的计算
```

### PS: 对于KNN算法无法区分特征权重的问题通过线性回归问题解决