# KNN算法举例：Airbnb 房价预测
KNN（K近邻）算法常用于进行分类或者回归

#### 数据特征：

* accommodates: 可以容纳的旅客
* bedrooms: 卧室的数量
* bathrooms: 厕所的数量
* beds: 床的数量
* price: 每晚的费用
* minimum_nights: 客人最少租了几天
* maximum_nights: 客人最多租了几天
* number_of_reviews: 评论的数量

## 1. 准备数据

In [16]:
import pandas as pd

# preprocessing 数据预处理模块
from sklearn.preprocessing import StandardScaler

# 读取数据
dc_listings = pd.read_csv(r'D:/001  学习文件/001  学习笔记/020  唐宇迪机器学习实战_20200326/机器学习文件/006 K近邻实例/KNN/listings.csv')

# 选取数据
features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews'] 
dc_listings = dc_listings[features]

# 清洗数据
## 删除空值
dc_listings = dc_listings.dropna()
## 处理价格单位
dc_listings['price'] = dc_listings.price.str.replace("\$|,","").astype(float)
## 数据标准化
dc_listings[features] = StandardScaler().fit_transform(dc_listings[features])

# 查看数据
dc_listings.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,0.40142,-0.249501,-0.439211,0.297386,0.081119,-0.341421,-0.016575,-0.516779
1,1.399466,2.129508,2.969551,1.141704,1.462622,-0.065047,-0.016606,1.706767
2,-1.095648,-0.249501,1.26517,-0.546933,-0.718699,-0.065047,-0.016575,-0.482571
3,-0.596625,-0.249501,-0.439211,-0.546933,-0.391501,-0.341421,-0.016575,-0.516779
4,0.40142,-0.249501,-0.439211,-0.546933,-0.718699,1.316824,-0.016575,-0.516779


## 2. 切分数据（对于KNN不需要）
将数据打乱顺序后，分隔成训练集、测试集

In [17]:
# 打乱数据顺序
dc_listings = dc_listings.sample(frac=1,random_state=0)

# 切分数据
train_df = dc_listings.copy().iloc[:2972]
test_df = dc_listings.copy().iloc[2972:]

## 3. 训练模型

In [18]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
cols = ['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

# 不知K取值哪个合适，所以进行输入多个数值，通过最后的误差情况判断K取值
for i in range(3,21):
    # 实例化算法函数
    knn = KNeighborsRegressor(i)

    knn.fit(train_df[cols], train_df['price'])
    four_features_predictions = knn.predict(test_df[cols])
    four_features_mse = mean_squared_error(test_df['price'], four_features_predictions)
    four_features_rmse = four_features_mse ** (1/2)

    print(i,four_features_rmse)

3 0.7584644188322903
4 0.7250856543685408
5 0.7194606079121513
6 0.713523205399514
7 0.6874858944276564
8 0.6832742580337134
9 0.6850467595100533
10 0.6887963071179641
11 0.688040448342358
12 0.6889001022301897
13 0.6862124933883886
14 0.6836203918618494
15 0.6849737588586512
16 0.6808496255480946
17 0.6839540799872993
18 0.6855824464950708
19 0.6906904450945108
20 0.6925732997782492


当K = 16时，误差最小

## 4. 输出结果

In [22]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
cols = ['accommodates','bedrooms','bathrooms','beds','minimum_nights','maximum_nights','number_of_reviews']

knn = KNeighborsRegressor(16)
knn.fit(train_df[cols], train_df['price'])
test_df['four_features_predictions'] = knn.predict(test_df[cols])
four_features_mse = mean_squared_error(test_df['price'], test_df['four_features_predictions'])
test_df['four_features_rmse'] = four_features_mse ** (1/2)

test_df.head()

Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews,four_features_predictions,four_features_rmse
711,1.399466,-0.249501,-0.439211,2.830341,0.008408,-0.341421,-0.016575,0.098972,0.016588,0.68085
2291,0.900443,-0.249501,-0.439211,0.297386,0.001137,-0.341421,-0.016575,-0.414154,-0.181549,0.68085
1639,-0.097602,-0.249501,0.412979,-0.546933,-0.536922,-0.341421,-0.016575,-0.414154,-0.367416,0.68085
847,-0.596625,-0.249501,-0.439211,-0.546933,-0.464212,-0.341421,-0.016575,-0.311529,-0.286979,0.68085
969,0.40142,-0.249501,-0.439211,0.297386,-0.362417,-0.065047,-0.016575,-0.482571,0.314247,0.68085
