### Sklearn-KNN

#### 1. 数据预处理
- 选择合适特征
- 将特征数据转换为合适的类型（如：字符转浮点，连续时间离散化等）
- 处理缺失值
- 标准化/归一化

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# 选择合适特征
dc_listings = pd.read_csv('listings.csv')
features = ['accommodates','bedrooms','bathrooms','beds','price','minimum_nights','maximum_nights','number_of_reviews']
dc_listings = dc_listings[features]

# 将特征数据转换为合适的类型
dc_listings['price'] = dc_listings.price.str.replace("\$|,",'',regex=True).astype(float)

# 处理缺失值
dc_listings = dc_listings.dropna()

# 标准化/归一化
dc_listings[features] = StandardScaler().fit_transform(dc_listings[features])
normalized_listings = dc_listings

# 查看结果
print(dc_listings.shape)
normalized_listings.head()

(3671, 8)


Unnamed: 0,accommodates,bedrooms,bathrooms,beds,price,minimum_nights,maximum_nights,number_of_reviews
0,0.40142,-0.249501,-0.439211,0.297386,0.081119,-0.341421,-0.016575,-0.516779
1,1.399466,2.129508,2.969551,1.141704,1.462622,-0.065047,-0.016606,1.706767
2,-1.095648,-0.249501,1.26517,-0.546933,-0.718699,-0.065047,-0.016575,-0.482571
3,-0.596625,-0.249501,-0.439211,-0.546933,-0.391501,-0.341421,-0.016575,-0.516779
4,0.40142,-0.249501,-0.439211,-0.546933,-0.718699,1.316824,-0.016575,-0.516779


#### 2. 划分训练集和测试集

In [5]:
norm_train_df = normalized_listings.copy().iloc[0:2792]
norm_test_df = normalized_listings.copy().iloc[2792:]

#### 3. 训练
- 选择训练特征
- 构造实体
- fit
- predict

In [7]:
from sklearn.neighbors import KNeighborsRegressor
# 选择训练特征
cols = ['accommodates','bedrooms']

# 构造实体
knn = KNeighborsRegressor()

# fit
knn.fit(norm_train_df[cols], norm_train_df['price'])

# predict
two_features_predictions = knn.predict(norm_test_df[cols])

#### 4. 评估

root mean squared error (RMSE)均方根误差


In [8]:
from sklearn.metrics import mean_squared_error

two_features_mse = mean_squared_error(norm_test_df['price'], two_features_predictions)
two_features_rmse = two_features_mse ** (1/2)
print(two_features_rmse)

0.8426824704818202
