任务: 
    预测Facebook上用户签到位置
    
流程：
```
1. 数据读取与清洗
    1.1 缩小数据范围
    1.2 处理time（时间戳变成年月日时分秒）
    1.3 过滤签到次数少的地点
2. 切分训练集与测试集
3. 数据标准化
4. 模型训练与预测
5. 模型调优
```

In [1]:
import pandas as pd
import numpy as np

读取训练和测试数据

In [2]:
train = pd.read_csv("/data/ys_data/facebook/train.csv")
test = pd.read_csv("/data/ys_data/facebook/test.csv")

数据清洗 - 缩小数据范围（调试算法用，算法优化后跑全量不用）

In [5]:
train = train.query("x < 2.5 & x > 2 & y < 1.5 & y > 1.0")

数据清洗 - 处理时间特征

In [6]:
time_value = pd.to_datetime(train["time"], unit="s")
date = pd.DatetimeIndex(time_value)
train["day"] = date.day
train["weekday"] = date.weekday
train["hour"] = date.hour

数据清洗 - 过滤签到次数比较少的地方

In [7]:
place_count = train.groupby("place_id").count()["row_id"]
train_final = train[train["place_id"].isin(place_count[place_count > 3].index.values)]

筛选特征值和目标值

In [8]:
x = train_final[["x", "y", "accuracy", "day", "weekday", "hour"]]
y = train_final["place_id"]

切分训练集与测试集

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=10)

数据标准化

In [10]:
from sklearn.preprocessing import StandardScaler

transfer = StandardScaler()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

模型调优

In [11]:
from sklearn.tree import DecisionTreeClassifier

estimator = DecisionTreeClassifier(criterion='gini')

模型训练与评估

In [12]:
estimator.fit(x_train, y_train)

score = estimator.score(x_test, y_test)
print("准确率为:\n", score)

准确率为:
 0.34778524817085227


查看最优估计器等参数

In [45]:
print("最佳参数:\n", estimator.best_params_)
print("最佳结果:\n", estimator.best_score_)
print("最佳估计器:\n", estimator.best_estimator_)
print("交叉验证结果:\n", estimator.cv_results_)

最佳参数:
 {'n_neighbors': 5}
最佳结果:
 0.36122741352929555
最佳估计器:
 KNeighborsClassifier()
交叉验证结果:
 {'mean_fit_time': array([0.04645412, 0.04665968, 0.04541981, 0.04796171, 0.0452919 ,
       0.04590189, 0.04634006, 0.04715374, 0.04658577, 0.04746213,
       0.04792147, 0.04532144, 0.04514709, 0.04531186]), 'std_fit_time': array([0.00185437, 0.0015771 , 0.00092431, 0.00769452, 0.00075101,
       0.00079017, 0.0017076 , 0.0016195 , 0.00149691, 0.00207092,
       0.00362087, 0.000864  , 0.00080497, 0.00091073]), 'mean_score_time': array([0.14262388, 0.17309763, 0.18409896, 0.18865788, 0.19136951,
       0.20463109, 0.20438697, 0.21465096, 0.22390065, 0.23209233,
       0.2367444 , 0.23453414, 0.23557563, 0.24891617]), 'std_score_time': array([0.00651548, 0.00537264, 0.01186658, 0.00645341, 0.00250457,
       0.01210409, 0.00225328, 0.00466696, 0.00609704, 0.00775368,
       0.00761838, 0.00428241, 0.00387544, 0.01078873]), 'param_n_neighbors': masked_array(data=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1