# Modeling

`Author: YUAN Yanzhe`

- In this notebook, XGBoost is chosen as the regressor to model the data and gridsearch is used as the fine-tuning method.
    - Usually we use the following methods for time series prediction
        - DL model like LSTM
        - statistical model like ARMA/ARIMA
        - traditional ML model like linear model, XGBoost, etc.
    - After looking at the data, there is no obvious time-serial trend for the traiffic speed, so I think statistical model like ARMA that is based on the idea of moving average may not perform well on this task.
    - What's more, there are lots of feautres that can be mined and extracted from 'date' and additional weather resources and they may influence the traffic speed. So the task can be converted to a feature engineering driven task, i.e. the more useful and related features we get, the more the regressor may predict.
    - And it turns out XGBoost is way better than these models (from other competitors' results). 
    - So this is why XGB is chosen as the model for this task.

- The codes are originally run in Google Colab

- The best result is in `xgboost_submit_final.csv`

- improvement: forecastxgb?

- Due to the upload limit, other experiments done with features dummied are done in the account:Jackbighead. However, results come from this ipynb (team_name: youngandcold) is relatively better, so i use this one as the final .ipynb file.

## Load Data

In [29]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import neighbors
from sklearn.utils import shuffle
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.model_selection import cross_val_score, GridSearchCV
import xgboost as xgb
from xgboost import XGBRegressor

from sklearn import datasets
from sklearn import metrics
import pandas as pd

In [30]:
# read data
all_data = pd.read_csv("/content/drive/My Drive/5001_kaggle/train_cleaned_data7.csv")
sub_data = pd.read_csv("/content/drive/My Drive/5001_kaggle/test_cleaned_data7.csv")
sub_form = pd.read_csv("/content/drive/My Drive/5001_kaggle/sampleSubmission.csv")
sub_data

Unnamed: 0,id,hour,month,day,year,weekday,holiday,speed,tempC,visibility,winddirDegree,windspeedKmph,humidity,cloudcover,WindChillC
0,0,2,1,1,2018,0,1,0,19,10,65,12,63,23,18
1,1,5,1,1,2018,0,1,0,19,10,65,12,63,23,18
2,2,7,1,1,2018,0,1,0,19,10,65,12,63,23,18
3,3,8,1,1,2018,0,1,0,19,10,65,12,63,23,18
4,4,10,1,1,2018,0,1,0,19,10,65,12,63,23,18
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3499,3499,17,12,31,2018,0,0,0,12,10,138,18,69,79,10
3500,3500,19,12,31,2018,0,0,0,12,10,138,18,69,79,10
3501,3501,21,12,31,2018,0,0,0,12,10,138,18,69,79,10
3502,3502,22,12,31,2018,0,0,0,12,10,138,18,69,79,10


In [31]:
# choose features
X=all_data[['holiday','hour','month','day','year','weekday','tempC','visibility','winddirDegree','windspeedKmph','humidity','cloudcover','WindChillC']]
x=sub_data[['holiday','hour','month','day','year','weekday','tempC','visibility','winddirDegree','windspeedKmph','humidity','cloudcover','WindChillC']]
y=all_data[['speed']]

In [32]:
# split our train data , we only set test_size=0.01 in last submit.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=1)
print(len(X_train))
print(len(X_test))

13865
141


## XGBoost Finetunning

- Best Parameters:
    - n_estimators = 700
    - max_depth = 4
    - min_child_weight = 1
    - gamma = 0.6
    - 'colsample_bytree': 0.6
    - 'subsample': 0.9
    - 'reg_alpha': 3
    - 'reg_lambda':3
    - learningrate : 0.1

In [84]:
#model = xgb.XGBRegressor(learning_rate=0.08, n_estimators=330, max_depth=6, min_child_weight=7, seed=0,
                             #subsample=0.8, colsample_bytree=0.85, gamma=0.8, reg_alpha=13, reg_lambda=3, base_score=0.7)
model = xgb.XGBRegressor(learning_rate=0.08, n_estimators=600, max_depth=6,
                             subsample=0.7,base_score=0.7,gamma=0.85)
model.fit(X_train, y_train)



XGBRegressor(base_score=0.7, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0.85,
             importance_type='gain', learning_rate=0.08, max_delta_step=0,
             max_depth=6, min_child_weight=1, missing=None, n_estimators=600,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
             silent=None, subsample=0.7, verbosity=1)

In [None]:
cv_params = {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'min_child_weight': [1, 2, 3, 4, 5, 6]}
other_params = {'learning_rate': 0.1, 'n_estimators': 700, 'max_depth': 5, 'min_child_weight': 1, 'seed': 0,
          'subsample': 0.8, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0, 'reg_lambda': 1}


model = xgb.XGBRegressor(**other_params)
optimized_GBM = GridSearchCV(estimator=model, param_grid=cv_params, scoring='r2', cv=5, verbose=1, n_jobs=4)
optimized_GBM.fit(X_train, y_train)
#evalute_result = optimized_GBM.grid_scores_
#print('每轮迭代运行结果:{0}'.format(evalute_result))
print('参数的最佳取值：{0}'.format(optimized_GBM.best_params_))
print('最佳模型得分:{0}'.format(optimized_GBM.best_score_))

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  3.1min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 22.0min
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 31.0min finished


参数的最佳取值：{'max_depth': 4, 'min_child_weight': 1}
最佳模型得分:0.9029020398853819


In [85]:
y_pred = model.predict(X_test)
MSE = metrics.mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))

print('MSE:',MSE)
print('RMSE:',RMSE)

MSE: 7.922532365631455
RMSE: 2.814699338407471


generate the final submit csv file 

In [77]:
#index=[0,2]
#sub_features = sub_data.drop(sub_data.columns[index], axis=1)
#sub_features
sub_speed = model.predict(x)
sub_speed = pd.DataFrame(sub_speed)
print(sub_speed)

              0
0     48.220631
1     48.280731
2     39.437141
3     30.592541
4     39.120834
...         ...
3499  12.087507
3500  24.505507
3501  48.373161
3502  41.343384
3503  44.407139

[3504 rows x 1 columns]


In [78]:
sub_form["speed"] = sub_speed
print(sub_form)

        id      speed
0        0  48.220631
1        1  48.280731
2        2  39.437141
3        3  30.592541
4        4  39.120834
...    ...        ...
3499  3499  12.087507
3500  3500  24.505507
3501  3501  48.373161
3502  3502  41.343384
3503  3503  44.407139

[3504 rows x 2 columns]


In [79]:
sub_form.to_csv("/content/drive/My Drive/5001_kaggle/xgboost_submit_4.csv", index=False)