## 1. 目标

目标是根据车辆的历史轨迹序列来预测它下一个到达的位置

## 2. 建模策略

类别定义：每个独立的空间位置（如交叉口或路段）作为一个类别。

特征工程：使用车辆的历史轨迹数据，包括位置坐标、行驶速度、行驶距离等作为特征。

时间序列分析：考虑轨迹的时间顺序，可能需要使用适合时间序列数据的模型，如循环神经网络（RNN）。

## 3. 关键难点及解决思路

高维类别空间：空间位置可能非常多，导致类别过多。解决思路是进行空间聚类，将邻近的位置合并为较大的区域。

时间序列依赖性：轨迹数据具有时间序列特性，传统分类模型可能难以捕捉。解决思路是使用RNN或LSTM等能处理时间序列数据的模型。

数据不平衡：某些位置可能被频繁访问，而其他位置则不然。解决思路包括重采样或使用特定于类别的权重。

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [2]:
traj_df = pd.read_csv('./data/traj.csv',header=0,encoding='utf-8')

In [3]:
traj_df.head()

Unnamed: 0,id,time,entity_id,traj_id,coordinates,current_dis,speeds,holidays
0,0,2013-10-08T17:45:00Z,254,0,"[116.318726,40.009014]",0.0,36.69,0
1,1,2013-10-08T17:46:45Z,254,0,"[116.315102,40.004784]",0.562623,24.5375,0
2,2,2013-10-08T17:47:39Z,254,0,"[116.315018,40.002842]",0.778695,31.9675,0
3,3,2013-10-08T17:49:26Z,254,0,"[116.315041,39.998585]",1.252148,19.785,0
4,4,2013-10-08T17:51:15Z,254,0,"[116.315605,39.992554]",1.924533,24.45,0


In [4]:
# 获取每个车辆下一个next_coordinates
traj_df['next_coordinates'] = traj_df.groupby(['entity_id', 'traj_id'])['coordinates'].shift(-1)

In [5]:
traj_df.head()

Unnamed: 0,id,time,entity_id,traj_id,coordinates,current_dis,speeds,holidays,next_coordinates
0,0,2013-10-08T17:45:00Z,254,0,"[116.318726,40.009014]",0.0,36.69,0,"[116.315102,40.004784]"
1,1,2013-10-08T17:46:45Z,254,0,"[116.315102,40.004784]",0.562623,24.5375,0,"[116.315018,40.002842]"
2,2,2013-10-08T17:47:39Z,254,0,"[116.315018,40.002842]",0.778695,31.9675,0,"[116.315041,39.998585]"
3,3,2013-10-08T17:49:26Z,254,0,"[116.315041,39.998585]",1.252148,19.785,0,"[116.315605,39.992554]"
4,4,2013-10-08T17:51:15Z,254,0,"[116.315605,39.992554]",1.924533,24.45,0,"[116.315735,39.987846]"


In [6]:
traj_df = traj_df.dropna()

In [7]:
import warnings
warnings.filterwarnings("ignore")

traj_df['start_lon'] = traj_df['coordinates'].apply(lambda x:eval(x)[0])
traj_df['start_lat'] = traj_df['coordinates'].apply(lambda x:eval(x)[1])
traj_df['next_lon'] = traj_df['next_coordinates'].apply(lambda x:eval(str(x))[0])
traj_df['next_lat'] = traj_df['next_coordinates'].apply(lambda x:eval(str(x))[1])

In [8]:
traj_df.head()

Unnamed: 0,id,time,entity_id,traj_id,coordinates,current_dis,speeds,holidays,next_coordinates,start_lon,start_lat,next_lon,next_lat
0,0,2013-10-08T17:45:00Z,254,0,"[116.318726,40.009014]",0.0,36.69,0,"[116.315102,40.004784]",116.318726,40.009014,116.315102,40.004784
1,1,2013-10-08T17:46:45Z,254,0,"[116.315102,40.004784]",0.562623,24.5375,0,"[116.315018,40.002842]",116.315102,40.004784,116.315018,40.002842
2,2,2013-10-08T17:47:39Z,254,0,"[116.315018,40.002842]",0.778695,31.9675,0,"[116.315041,39.998585]",116.315018,40.002842,116.315041,39.998585
3,3,2013-10-08T17:49:26Z,254,0,"[116.315041,39.998585]",1.252148,19.785,0,"[116.315605,39.992554]",116.315041,39.998585,116.315605,39.992554
4,4,2013-10-08T17:51:15Z,254,0,"[116.315605,39.992554]",1.924533,24.45,0,"[116.315735,39.987846]",116.315605,39.992554,116.315735,39.987846


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

features = ['traj_id', 'current_dis', 'speeds', 'holidays', 'start_lon', 'start_lat'] 
X = traj_df[features]

# 目标变量为next_coordinates的经度和纬度
y = traj_df[['next_lon', 'next_lat']]

# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 模型训练
model = RandomForestRegressor()
model.fit(X_train, y_train)

# 预测
predictions = model.predict(X_test)

# 评估
mse = mean_squared_error(y_test, predictions, multioutput='raw_values')
rmse = np.sqrt(mse)
print(f'RMSE for Longitude: {rmse[0]}, RMSE for Latitude: {rmse[1]}')


RMSE for Longitude: 0.0029241525590863013, RMSE for Latitude: 0.0021496868952790398


In [10]:
X_test[['next_lon', 'next_lat']] = y[['next_lon', 'next_lat']]

In [11]:
X_test.to_csv('test_4.csv',encoding='utf-8',index=False)