<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#前言" data-toc-modified-id="前言-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>前言</a></span></li><li><span><a href="#实验" data-toc-modified-id="实验-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>实验</a></span><ul class="toc-item"><li><span><a href="#单特征,-单样本预测-(分步)" data-toc-modified-id="单特征,-单样本预测-(分步)-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>单特征, 单样本预测 (分步)</a></span></li><li><span><a href="#单特征,-单样本预测-(整合)" data-toc-modified-id="单特征,-单样本预测-(整合)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>单特征, 单样本预测 (整合)</a></span></li></ul></li><li><span><a href="#实战" data-toc-modified-id="实战-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>实战</a></span><ul class="toc-item"><li><span><a href="#数据清理" data-toc-modified-id="数据清理-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>数据清理</a></span><ul class="toc-item"><li><span><a href="#数据集信息" data-toc-modified-id="数据集信息-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>数据集信息</a></span></li><li><span><a href="#处理空数据" data-toc-modified-id="处理空数据-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>处理空数据</a></span></li><li><span><a href="#归一化数据" data-toc-modified-id="归一化数据-3.1.3"><span class="toc-item-num">3.1.3&nbsp;&nbsp;</span>归一化数据</a></span></li></ul></li><li><span><a href="#评估标准-mse和rmse" data-toc-modified-id="评估标准-mse和rmse-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>评估标准-mse和rmse</a></span></li><li><span><a href="#Sklearn-预测" data-toc-modified-id="Sklearn-预测-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Sklearn 预测</a></span></li></ul></li></ul></div>

# 前言 

KNN算法是比较简单的机器学习算法. 它属于惰性学习, 没有训练过程. **算法中将和预测样本最近的n个样本的平均特征作为预测结果**

**三个步骤**:
1. 确定参数标准(比如欧式距离)
1. 根据定义的参数计算出前k个元素
1. 将这K个元素的平均特征作为其特征

**缺点:**
1. 计算复杂度高, 适用于小数据集

下面两个单元, 第二单元以单特征, 单样本预测为例, 用pandas, numpy解释原理；第三单元用多特征, 用sklearn模块展示knn算法

# 实验

## 单特征, 单样本预测 (分步)

airbnb 数据预测价格, 单特征: accommodates.  单样本: new_listing=3

参数标准**距离**为:$$|x-y|$$

下面展示距离的频率

In [25]:
import pandas as pd
import numpy as np

dc_listings = pd.read_csv('dc_airbnb.csv')
new_listing = 3
dc_listings['distance'] = dc_listings['accommodates'].apply(lambda x: np.abs(x - new_listing))
print(dc_listings['distance'].value_counts())

1     2294
2      503
0      461
3      279
5       73
4       35
7       22
6       17
9       12
13       8
8        7
12       6
11       4
10       2
Name: distance, dtype: int64


将列表以distance从小到大的顺序从上到下排列

In [26]:
import numpy as np
np.random.seed(1)
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
dc_listings = dc_listings.sort_values('distance')
dc_listings[['price', 'distance']].head(5)

Unnamed: 0,price,distance
577,$185.00,0
2166,$180.00,0
3631,$175.00,0
71,$128.00,0
1011,$115.00,0


以前5个的平均价格作为预测价格

In [27]:
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
mean_price = dc_listings.iloc[0:5]['price'].mean()
print('The pridicted price for 1 accommodate is:', mean_price)

The pridicted price for 1 accommodate is: 156.6


## 单特征, 单样本预测 (整合)

将以上步骤整合在一起如下:

In [28]:
# Brought along the changes we made to the `dc_listings` Dataframe.
dc_listings = pd.read_csv('dc_airbnb.csv')
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]

def predict_price(new_listing):
    temp_df = dc_listings.copy()
    temp_df['distance'] = temp_df['accommodates'].apply(lambda x: np.abs(x - new_listing))
    temp_df = temp_df.sort_values('distance')
    nearest_neighbors = temp_df.iloc[0:5]['price']
    predicted_price = nearest_neighbors.mean()
    return(predicted_price)

acc_one = predict_price(1)
acc_two = predict_price(2)
acc_four = predict_price(4)
print('The pridicted price for 1 accommodate is:',acc_one)
print('The pridicted price for 2 accommodates is:',acc_two)
print('The pridicted price for 4 accommodates is:',acc_four)

The pridicted price for 1 accommodate is: 71.8
The pridicted price for 2 accommodates is: 96.8
The pridicted price for 4 accommodates is: 96.0


# 实战 

## 数据清理 

### 数据集信息

In [35]:
import pandas as pd
import numpy as np
np.random.seed(1)

dc_listings = pd.read_csv('dc_airbnb.csv')
dc_listings = dc_listings.loc[np.random.permutation(len(dc_listings))]
stripped_commas = dc_listings['price'].str.replace(',', '')
stripped_dollars = stripped_commas.str.replace('$', '')
dc_listings['price'] = stripped_dollars.astype('float')
print(dc_listings.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3723 entries, 574 to 1061
Data columns (total 9 columns):
Unnamed: 0           3723 non-null int64
accommodates         3723 non-null int64
bedrooms             3702 non-null float64
bathrooms            3696 non-null float64
beds                 3712 non-null float64
price                3723 non-null float64
minimum_nights       3723 non-null int64
maximum_nights       3723 non-null int64
number_of_reviews    3723 non-null int64
dtypes: float64(4), int64(5)
memory usage: 290.9 KB
None


### 处理空数据

In [36]:
print('处理之前:')
print(dc_listings.isnull().sum())
print()
dc_listings = dc_listings.dropna(axis=0)
print('处理之后:')
print( dc_listings.isnull().sum())

处理之前:
Unnamed: 0            0
accommodates          0
bedrooms             21
bathrooms            27
beds                 11
price                 0
minimum_nights        0
maximum_nights        0
number_of_reviews     0
dtype: int64

处理之后:
Unnamed: 0           0
accommodates         0
bedrooms             0
bathrooms            0
beds                 0
price                0
minimum_nights       0
maximum_nights       0
number_of_reviews    0
dtype: int64


### 归一化数据 

In [37]:
normalized_listings = (dc_listings - dc_listings.mean()) / dc_listings.std()
normalized_listings['price'] = dc_listings['price']
print(normalized_listings.head(3))

      Unnamed: 0  accommodates  bedrooms  bathrooms      beds  price  \
574    -1.203033     -0.596544 -0.249467  -0.439151 -0.546858  125.0   
1593   -0.255599     -0.596544 -0.249467   0.412923 -0.546858   85.0   
3091    1.137193     -1.095499 -0.249467  -1.291226 -0.546858   50.0   

      minimum_nights  maximum_nights  number_of_reviews  
574        -0.341375       -0.016604           4.579650  
1593       -0.341375       -0.016603           1.159275  
3091       -0.341375       -0.016573          -0.482505  


## 评估标准-mse和rmse

均方误差(mean square error)定义如下:

假设正式值为 $x_1, x_2,...x_n$, 预测值为$p_1, p_2...p_n$,mse为:
$$RMS=\frac{\sum_1^n(x_i-p_i)^2}{n}$$

RMSE为:
$$RMSE=\sqrt{RMS}$$

In [40]:
from sklearn.metrics import mean_squared_error

## Sklearn 预测

In [44]:
features = ['accommodates', 'bedrooms', 'bathrooms', 'beds', 'minimum_nights', 'maximum_nights', 'number_of_reviews']
from sklearn.neighbors import KNeighborsRegressor
#划分 train, test set
train_df = normalized_listings.iloc[0:2792]
test_df = normalized_listings.iloc[2792:]

knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
train_features = train_df[features]
train_target = train_df['price']
test_features = test_df[features]
# 拟合
knn.fit(train_features, train_target)
# 预测
all_features_predictions = knn.predict(test_features)
all_features_mse = mean_squared_error(test_df['price'], all_features_predictions)
all_features_rmse = np.power(all_features_mse, 0.5)
print('mse is:',all_features_mse)
print('rmse is:', all_features_rmse)

mse is: 15455.168464163822
rmse is: 124.31881782000592
