# Overview

Stage 3 is a little bit similar to Stage 2, as we are trying to fit an ensemble model using the training data to build prediction about the duration and trajlength from the test data. What is different is that We combine the information about the training data + trajlength to fit a model to predict duration, and then we fit model using training data + duration to predict trajlength. As we do not have any information about trajlength/duration in our test data, we will use our prediction from stage 2 as the values. Furthermore, I tried both using the predicted training duration and trajlength values from stage 2 to fit the model and the true training duration and trajlength values to fit the model. Unfortunately, both set of values produces similar result

I believe that the information about trajlength is crucial in predicting duration and vice versa. This will help our model to learn about the duration covariates when predicting trajlength, and vice versa. This improves the RMSPE in our Kaggle Submission significantly

In [54]:
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, 
GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
from sklearn.linear_model import LassoCV, ElasticNetCV
from sklearn.kernel_ridge import KernelRidge
import xgboost as xgb
from scipy import sparse
import numpy as np
import pandas as pd
from keras.models import load_model
from keras.layers import Input, Dense,
BatchNormalization, Dropout, Activation
from keras.models import Model
from keras import backend as K
from scipy.sparse import coo_matrix, hstack

Similar to stage 2, we will load the train and test data after the feature engineering in Stage 0

In [2]:
sX_train_stage0 = joblib.load('sX_train_stage0.pkl')
sX_test_stage0 = joblib.load('sX_test_stage0.pkl')
print sX_train_stage0.shape
print sX_test_stage0.shape

(465172, 2701)
(465172, 2701)


We will also load the true values of the trajlength and duration from the training data

In [3]:
Y_train_price = joblib.load( 'Y_train_price.pkl')
Y_train_duration = joblib.load('Y_train_duration.pkl')
Y_train_trajlength = joblib.load('Y_train_trajlength.pkl')

Other than that, we load the prediction for the training values for both duration and trajlength in Stage 2 using the Ensemble Model (Lasso Model fitted with the prediction from the Random Forest Model and XGBoost Model)

In [4]:
Y_train_dur_pred = \
joblib.load('Y_train_dur_pred_stage2v2.pkl')
Y_train_traj_pred = \
joblib.load('Y_train_traj_pred_stage2v2.pkl')

In [5]:
print sX_train_stage0.shape
print sX_test_stage0.shape
print Y_train_price.shape
print Y_train_duration.shape
print Y_train_trajlength.shape

(465172, 2701)
(465172, 2701)
(465172,)
(465172,)
(465172,)


Lastly, we will remove all training values that are deemed as an outlier from Stage 1

In [6]:
non_outlier_index_stage1 = \
np.array(joblib.load('non_outlier_index_stage1.pkl'))
print non_outlier_index_stage1.shape
print non_outlier_index_stage1

(464151,)
[     0      1      2 ... 465169 465170 465171]


We can then create the new features to be fitted using the ensemble model in this stage. For Ensemble Model that predicts duration, we will use the original features generated from stage 0 + Predicted value of the trajlength from Stage 2. For Ensemble Model that predicts trajlength, we will use the original features generated from stage 0 + Predited value of the duration from Stage 2

In [7]:
sX_train_stage0 = sX_train_stage0[non_outlier_index_stage1]

sX_train_stage0_duration = \
hstack([sX_train_stage0, Y_train_dur_pred.reshape(-1, 1)])
sX_train_stage0_trajlength = \
hstack([sX_train_stage0, Y_train_traj_pred.reshape(-1, 1)])

sX_train_dur = sparse.csc_matrix(sX_train_stage0_duration)
sX_train_traj = sparse.csc_matrix(sX_train_stage0_trajlength)

In [8]:
Y_dur_stage2 = joblib.load('Y_dur_stage4.pkl')
Y_traj_stage2 = joblib.load('Y_traj_stage4.pkl')
Y_dur_stage2 = Y_dur_stage2.reshape(-1, 1)
Y_traj_stage2 = Y_traj_stage2.reshape(-1, 1)

sX_test_dur = hstack((sX_test_stage0, Y_dur_stage2))
sX_test_traj = hstack((sX_test_stage0, Y_traj_stage2))
sX_test_dur = sparse.csc_matrix(sX_test_dur)
sX_test_traj = sparse.csc_matrix(sX_test_traj)

In [9]:
Y_train_pri = Y_train_price[non_outlier_index_stage1]
Y_train_dur = Y_train_duration[non_outlier_index_stage1]
Y_train_traj = Y_train_trajlength[non_outlier_index_stage1]

In [10]:
n_train = sX_train_dur.shape[0]
n_test = sX_test_dur.shape[0]

In [11]:
dtrain_dur = xgb.DMatrix(sX_train_traj, label = Y_train_dur)
dtrain_traj = xgb.DMatrix(sX_train_dur, label = Y_train_traj)
dtest_dur = xgb.DMatrix(sX_test_traj)
dtest_traj = xgb.DMatrix(sX_test_dur)

In [12]:
print Y_train_pri.shape
print Y_train_traj.shape
print Y_train_dur.shape

(464151,)
(464151,)
(464151,)


In [13]:
print sX_train_dur.shape
print sX_train_traj.shape
print sX_test_dur.shape
print sX_test_traj.shape

(464151, 2702)
(464151, 2702)
(465172, 2702)
(465172, 2702)


Similar to Stage 2, I actually perform train-validation split on the training data to choose appropriate parameters for my Ensemble Model. I will try the different parameters of the model by fitting it on the 80% of my training dataset, and check the performance on 20% of my training dataset by checking the RMSPE of the prediction vs the true values.

In [14]:
idx_fit, idx_val = \
train_test_split(np.arange(n_train), test_size = 0.20)

In [15]:
sX_fit_dur = sX_train_dur[idx_fit]
sX_fit_traj = sX_train_traj[idx_fit]

sX_val_dur = sX_train_dur[idx_val]
sX_val_traj = sX_train_traj[idx_val]

Y_fit_pri = Y_train_pri[idx_fit]
Y_fit_dur = Y_train_dur[idx_fit]
Y_fit_traj = Y_train_traj[idx_fit]

Y_val_pri = Y_train_pri[idx_val]
Y_val_dur = Y_train_dur[idx_val]
Y_val_traj = Y_train_traj[idx_val]

dfit_dur = xgb.DMatrix(sX_fit_traj, label = Y_fit_dur)
dfit_traj = xgb.DMatrix(sX_fit_dur, label = Y_fit_traj)
dval_dur = xgb.DMatrix(sX_val_traj, label = Y_val_dur)
dval_traj = xgb.DMatrix(sX_val_dur, label = Y_val_traj)

In [16]:
print sX_fit_dur.shape
print sX_fit_traj.shape
print sX_val_dur.shape
print sX_val_traj.shape
print Y_fit_pri.shape
print Y_fit_dur.shape
print Y_fit_traj.shape
print Y_val_pri.shape
print Y_val_dur.shape
print Y_val_traj.shape

(371320, 2702)
(371320, 2702)
(92831, 2702)
(92831, 2702)
(371320,)
(371320,)
(371320,)
(92831,)
(92831,)
(92831,)


In [17]:
from sklearn.metrics import make_scorer

def rmpse_loss_func(ground_truth, predictions):
    err = \
    np.sqrt(np.mean\
            ((np.true_divide(predictions,
                             ground_truth) - 1.)**2))
    return err

rmpse_loss  = make_scorer(rmpse_loss_func, 
                          greater_is_better=False)

def rmpse(preds, dtrain):
    labels = dtrain.get_label()
    err = np.sqrt\
    (np.mean((np.true_divide(preds,
                             labels) - 1.)**2))
    return 'error', err

In [18]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits = 3, shuffle = True, random_state=1234)

## Duration

To predict the duration values using ensemble model in Stage 3, I use ensemble model by fitting the prediction from

- Random Forest
- XGBoost
- Elastic Net
- Lasso

as a covariate to the Ensemble Lasso Model. We will then run trhough the test dataset to each of the base model, and use the result as the covariate for the ensemble model to come out with the final prediction for the duration

The covariate for each of the base model is the features that we have engineered in the training data in Stage 0 + Predicted Trajectory values in Stage 2.

Firstly, I perform Gridsearch on the XGBoost by varying the parameters of `max_depth`, `min_child_weight`, and `gamma`, and `colsample_bytree`. From this experiment, the two most important parameters to tune are `max_depth` and `gamma`. Without setting `gamma` to be a positive number, the trees can easily overfit as it deccreases loss error on training data but not validation dataset. 

In [None]:
#i = 0
#bst_dur_dict = {}
#for max_depth in [15, 20, 25, 30, 40]:
#    for min_child_weight in [1, 3, 5]:
#        for gamma in [0, 1]:
#            for colsample_bytree in [0.7, 1.0]:
#                print "idx:"+str(i)
#                param = { 'objective' : "reg:linear", 
#                          'booster' : "gbtree",
#                          'eta'                 :0.03, 
#                          'max_depth'           :max_depth, 
#                          'colsample_bytree'    :colsample_bytree,
#                          'min_child_weight' : min_child_weight,
#                          'gamma' : gamma,
#                         'subsample' : 0.7,
#                          'n_thread' : 8
#                        }
#                bst_dur = xgb.train(param, dfit_dur, \
#evals=[(dfit_dur, 'fit'), (dval_dur, 'val')], num_boost_round = 500,
#feval= rmpse, maximize = False)
#                rmpse_val = rmpse_loss_func(Y_val_dur , bst_dur.predict(dval_dur))
#                bst_dur_dict[i] = {
#                    'max_depth' : max_depth,
#                    'min_child_weight' : min_child_weight,
#                    'gamma' : gamma,
#                    'colsample_bytree' : colsample_bytree,
#                    'rmpse' : rmpse_val
#                }
#                i = i + 1
#joblib.dump(bst_dur_dict, 'bst_dur_dict.pkl')

In [26]:
bst_dur_dict = joblib.load('bst_dur_dict.pkl')
pd.DataFrame(bst_dur_dict)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,50,51,52,53,54,55,56,57,58,59
colsample_bytree,0.7,1.0,0.7,1.0,0.7,1.0,0.7,1.0,0.7,1.0,...,0.7,1.0,0.7,1.0,0.7,1.0,0.7,1.0,0.7,1.0
gamma,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,...,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0
max_depth,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,15.0,...,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0,40.0
min_child_weight,1.0,1.0,1.0,1.0,3.0,3.0,3.0,3.0,5.0,5.0,...,1.0,1.0,3.0,3.0,3.0,3.0,5.0,5.0,5.0,5.0
rmpse,0.029942,0.030059,0.031755,0.031766,0.029852,0.030116,0.031733,0.031773,0.029868,0.030062,...,0.03108,0.030923,0.030383,0.030026,0.03107,0.030889,0.030055,0.029804,0.031078,0.030907


Based on the grid search, I have decided to use the following values for my XGBoost for duration prediction. I choose a deep trees, with high values of gamma and minimum child weight. I believe that this model is complex enough to learn about the duration of each taxi ride based on the available information, while it's able to not overfit the training data due to the regularization parameter that I have set, `min_child_weight` and `gamma`.

In [29]:
param = { 'objective' : "reg:linear", 
          'booster' : "gbtree",
          'eta'                 :0.05, 
          'max_depth'           : 30, 
          'colsample_bytree'    : 0.7,
          'min_child_weight' : 3,
          'gamma' : 1,
          'subsample' : 0.7,
          'n_thread' : 8
        }

In [None]:
bst_dur = xgb.train(param, dtrain_dur, evals=[(dtrain_dur, 'train')], 
                num_boost_round = 2000, feval= rmpse, maximize = False)

For Random Forest Regressor, I found out that the most crucial parameters to tune is the `max_features`, as it affects the number of parameters considered at each node. As we perform one-hot encoding, our matrix is very sparse. therefore, if we use the default value such as `log2` or `sqrt`, it might randomly choose unimportant features (features where all the values is zero). Thus, our model will be very biased. The max depth introduced in this Stage is quite small as my computer could not accomodate deep trees. 

In [None]:
rf_dur = RandomForestRegressor(max_depth = 12, 
                               max_features = 0.3, 
                               n_estimators= 1000, 
                               verbose = 10, n_jobs = -1, 
                               criterion='mse', 
                               oob_score = True)\
.fit(sX_train_traj, Y_train_dur)

For Lasso Model and Elastic Net, i will use the inbuilt function under `sklearn` package which perform 3-Fold Cross validation with 100 different values for alphas. We will fit the training data + predicted trajectory values from Stage 2 to fit the model which predict the duration of the taxi ride.

In [None]:
lm_dur = LassoCV(n_jobs = -1, verbose = 3).\
fit(sX_train_traj, Y_train_dur)

In [None]:
enet_dur = ElasticNetCV(n_jobs = -1, verbose = 3)\
.fit(sX_train_traj, Y_train_dur)

We will then build an ensemble models, with the feature dimension (n_train, 4). Each feature consists of the prediction from the 4 models above - Random Forest, XGBoost, Lasso, and Elastic Net. We will also use Lasso for ensembling this 4 predictions, as we assume that the relationship between this 4 covariates should be linear. We will perform 3-fold Cross Validation with 100 unique values of alpha, using `LassoCV`

In [None]:
X_ensemble_dur = np.zeros((n_train, 4))
X_ensemble_dur[:,0] = bst_dur.predict(dtrain_dur)
X_ensemble_dur[:,1] = rf_dur.predict(sX_train_traj)
X_ensemble_dur[:,2] = lm_dur.predict(sX_train_traj)
X_ensemble_dur[:,3] = enet_dur.predict(sX_train_traj)
lasso_dur = LassoCV(n_jobs=-1, verbose=1)\
.fit(X_ensemble_dur, Y_train_dur)

In [36]:
joblib.dump(bst_dur, 'bst_dur_stage3v3.pkl')
joblib.dump(rf_dur, 'rf_dur_stage3v3.pkl')
joblib.dump(lm_dur, 'lm_dur_stage3v3.pkl')
joblib.dump(enet_dur, 'enet_dur_stage3v3.pkl')
joblib.dump(lasso_dur, 'lasso_dur_stage3v3.pkl')

['lasso_dur_stage3v3.pkl']

## Traj

Unfortunately, I didn't really have time to perform the grid search for the parameter for predicting the trajectory length values. Therefore, I assume that the optimal parameter for both XGBoost and RF model for predicting trajectory is rather similar to the one for duration, and thus I use similar parameter to fit my model for predicting the trajectory length values.

Similar to duration, in predicting the trajectory values, I will create an ensemble model (Lasso) by using the result from 4 base model as the covariates for the ensemble model

- XGBoost
- Random Forest
- Elastic Net
- Lasso

The model used is completely similar to the one which predict the duration values

For each of the Base model, we will fit the training data with features engineered from stage 0 (minus the outlier training values indicated in stage 1) and the prediction for duration values in Stage 2.

In [55]:
param = { 'objective' : "reg:linear", 
          'booster' : "gbtree",
          'eta'                 :0.05, 
          'max_depth'           : 30, 
          'colsample_bytree'    : 0.7,
          'min_child_weight' : 3,
          'gamma' : 1,
          'subsample' : 0.7,
          'n_thread' : 8
        }

In [None]:
bst_traj = xgb.train(param, dtrain_traj, evals=[(dtrain_traj, 'train')], 
                num_boost_round = 2000, feval= rmpse, maximize = False)

In [None]:
rf_traj = RandomForestRegressor(max_depth = 12,
                                max_features = 0.3,
                                n_estimators=1000,
                                verbose = 10,
                                n_jobs = -1, 
                                criterion='mse', 
                                oob_score = True)\
.fit(sX_train_dur, Y_train_traj)

In [None]:
lm_traj = LassoCV(n_jobs = -1,
                  verbose = 3)\
.fit(sX_train_dur, Y_train_traj)

In [None]:
enet_traj = ElasticNetCV(n_jobs = -1, 
                         verbose = 3)\
.fit(sX_train_dur, Y_train_traj)

In [None]:
X_ensemble_traj = np.zeros((n_train, 4))
X_ensemble_traj[:,0] = bst_traj.predict(dtrain_traj)
X_ensemble_traj[:,1] = rf_traj.predict(sX_train_dur)
X_ensemble_traj[:,2] = lm_traj.predict(sX_train_dur)
X_ensemble_traj[:,3] = enet_traj.predict(sX_train_dur)
lasso_traj = LassoCV(n_jobs = -1, verbose =1 )\
.fit(X_ensemble_traj, Y_train_traj)

In [43]:
joblib.dump(bst_traj, 'bst_traj_stage3v3.pkl')
joblib.dump(rf_traj, 'rf_traj_stage3v3.pkl')
joblib.dump(lm_traj, 'lm_traj_stage3v3.pkl')
joblib.dump(enet_traj, 'enet_traj_stage3v3.pkl')
joblib.dump(lasso_traj, 'lasso_traj_stage3v3.pkl')

['lasso_traj_stage3v3.pkl']

In [20]:
Y_train_traj_pred = lasso_traj.predict(X_ensemble_traj)
joblib.dump(Y_train_traj_pred, 'Y_train_traj_pred_stage3v2.pkl')

['Y_train_traj_pred_stage3v2.pkl']

# Combine

After fitting all the models using our training data, namely:

- XGBoost for Duration
- Random Forest for Duration
- Lasso for Duration
- Elastic Net for Duration
- Lasso as the Ensemble model for Duration


- XGBoost for Trajlength
- Random Forest for Trajlength
- Lasso for TrajLength
- Elastic Net for Trajlength
- Lasso as the Ensemble model for Trajlength

We can then perfrom prediction for Duration and Trajlength values using our test data. We will fit Test Data from Stage 0 + predicted trajlength values from Stage 2 for test data to each of our base model in Duration, and then use the result as the input for our Lasso Ensemble model for Duration. This will help us to attain Final predicted value for Duration. Using Test data from Stage 0 + predicted duration values from Stage 2 for test Data, we can perform the same steps for Trajlength model

In [None]:
X_test_ens = np.zeros((n_test, 4))
X_test_ens[:,0] = bst_dur.predict(dtest_dur)
X_test_ens[:,1] = rf_dur.predict(sX_test_traj)
X_test_ens[:,2] = lm_dur.predict(sX_test_traj)
X_test_ens[:,3] = enet_dur.predict(sX_test_traj)
Y_test_dur_pred = lasso_dur.predict(X_test_ens)

In [None]:
X_test_ens = np.zeros((n_test, 4))
X_test_ens[:,0] = bst_traj.predict(dtest_traj)
X_test_ens[:,1] = rf_traj.predict(sX_test_dur)
X_test_ens[:,2] = lm_traj.predict(sX_test_dur)
X_test_ens[:,3] = enet_traj.predict(sX_test_dur)
Y_test_traj_pred = lasso_traj.predict(X_test_ens)

In [22]:
joblib.dump(Y_test_dur_pred, 'Y_test_dur_pred_stage3v3.pkl')
joblib.dump(Y_test_traj_pred, 'Y_test_traj_pred_stage3v3.pkl')

['Y_test_traj_pred_stage3v2.pkl']

Our prediction is in log, so we have to use the exponential of the predicted values. Finally, we can add the predicted value of duration and trajlength to get the predicted value for price. We can then submit our result to Kaggle!:)

In [25]:
Y_test_pri_pred = np.exp(Y_test_dur_pred) + np.exp(Y_test_traj_pred)

In [26]:
test_id = pd.read_csv("test.csv").ID.values

In [27]:
data = {'ID': test_id,
       'PRICE': Y_test_pri_pred}
submission_df = pd.DataFrame(data = data)
submission_df.to_csv("stage_3_v3.csv", index=False)

In [28]:
submission_df

Unnamed: 0,ID,PRICE
0,465173,301.033741
1,465174,274.367411
2,465175,453.520390
3,465176,853.708869
4,465177,432.465489
5,465178,492.616076
6,465179,274.921103
7,465180,875.136083
8,465181,314.856632
9,465182,255.092013
