# Stage 2

In stage 2, we will perform the prediction on duration and trajlength using the non-outlier training dataset, with the new features engineered in Stage 0 and index outlier removal in Stage 1. In this stage, We will build a simple ensemble model using Random Forest and XGBoost, and the ensemble uses the Lasso Model. The model fitted using the training dataset can then be used to predict both duration values and trajlength values for the test dataset

In [8]:
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split, 
GridSearchCV
from sklearn.ensemble import RandomForestRegressor,
GradientBoostingRegressor
from sklearn.linear_model import LassoCV
import xgboost as xgb
from scipy import sparse
import numpy as np
import pandas as pd

We will first load the dataset that we have created from Stage 0

In [15]:
X_train_stage0 = joblib.load( 'X_train_stage0.pkl')
X_test_stage0 = joblib.load( 'X_test_stage0.pkl')
Y_train_price = joblib.load( 'Y_train_price.pkl')
Y_train_duration = joblib.load('Y_train_duration.pkl')
Y_train_trajlength = joblib.load('Y_train_trajlength.pkl')

In [16]:
print X_train_stage0.shape
print X_test_stage0.shape
print Y_train_price.shape
print Y_train_duration.shape
print Y_train_trajlength.shape

(465172, 1561)
(465172, 1561)
(465172,)
(465172,)
(465172,)


We will then use the non-outlier index we found in stage1 to remove all the outliers in the training data. As we can see from this, 1021 training data are removed

In [17]:
non_outlier_index_stage1 = joblib.load('non_outlier_index_stage1.pkl')
print non_outlier_index_stage1.shape
print non_outlier_index_stage1

(464151,)
[     0      1      2 ... 465169 465170 465171]


In [18]:
X_train_stage1 = X_train_stage0[non_outlier_index_stage1]
X_test_stage1 = X_test_stage0
Y_train_price = Y_train_price[non_outlier_index_stage1]
Y_train_duration  = Y_train_duration[non_outlier_index_stage1]
Y_train_trajlength = Y_train_trajlength[non_outlier_index_stage1]

In [19]:
n_train = X_train_stage1.shape[0]
n_test = X_test_stage1.shape[0]

In [20]:
print X_train_stage1.shape
print X_test_stage1.shape
print Y_train_duration.shape
print Y_train_trajlength.shape
print Y_train_price.shape

(464151, 1561)
(465172, 1561)
(464151,)
(464151,)
(464151,)


In order to find the appropriate parameter, we will split the trainin data 80%-20%. We will try different values of parameters for the Random Forest and Extreme Gradient Boosting Model and fit it in the 80% of the training data, and check the RMSPE for the remaining 20% of the training data. To save some space within the report, I will only show the code of the validation process.

In [34]:
idx_train, idx_val = train_test_split\
(np.arange(n_train), test_size = 0.20)

In [26]:
X_test = X_test_stage1

X_full_train = X_train_stage1
Y_full_train_dur = Y_train_duration
Y_full_train_traj = Y_train_trajlength

sX_full_train = sparse.csc_matrix(X_full_train)
sX_test = sparse.csc_matrix(X_test)

dtest = xgb.DMatrix(sX_test)
dtrain_full_dur = xgb.DMatrix(sX_full_train, 
                              label= Y_full_train_dur)
dtrain_full_traj = xgb.DMatrix(sX_full_train,
                               label= Y_full_train_traj)

In [33]:
X_train, X_val = \
X_train_stage1[idx_train], X_train_stage1[idx_val]
Y_train_dur, Y_val_dur =\
Y_train_duration[idx_train], Y_train_duration[idx_val]
Y_train_traj, Y_val_traj =\
Y_train_trajlength[idx_train], Y_train_trajlength[idx_val]
Y_train_pri, Y_val_pri =\
Y_train_price[idx_train], Y_train_price[idx_val]

In [40]:
print X_full_train.shape
print Y_full_train_dur.shape
print Y_full_train_traj.shape
print X_train.shape
print X_val.shape
print Y_train_dur.shape
print Y_val_dur.shape
print Y_train_traj.shape
print Y_val_traj.shape
print X_test.shape

(464151, 1561)
(464151,)
(464151,)
(371320, 1561)
(92831, 1561)
(371320,)
(92831,)
(371320,)
(92831,)
(465172, 1561)


As we perform one hot encoding, our matrix is large and sparse. Turning it from numpy array to sparse matrix boosts the performance

In [36]:
sX_train = sparse.csc_matrix(X_train)
sX_val = sparse.csc_matrix(X_val)

In [37]:
dtrain_dur = xgb.DMatrix(sX_train, label= Y_train_dur)
dval_dur = xgb.DMatrix(sX_val, label=Y_val_dur)
dtrain_traj = xgb.DMatrix(sX_train, label= Y_train_traj)
dval_traj = xgb.DMatrix(sX_val, label =Y_val_traj)

Similar in stage 1, we will build a costum function so that we are able to track the RMSPE loss when we are fitting our Random forest and XGBoost model

In [27]:
from sklearn.metrics import make_scorer

def rmpse_loss_func(ground_truth, predictions):
    err = np.sqrt\
    (np.mean((np.true_divide\
              (predictions, ground_truth) - 1.)**2))
    return err

rmpse_loss  = make_scorer(rmpse_loss_func, 
                          greater_is_better=False)

In [28]:
def rmpse(preds, dtrain):
    labels = dtrain.get_label()
    err = np.sqrt(np.mean((np.true_divide\
                           (preds, labels) - 1.)**2))
    return 'error', err

In [38]:
watchlist_dur = [(dval_dur, 'eval_dur'),
                 (dtrain_dur, 'train_dur')]
watchlist_traj = [(dval_traj, 'eval_traj'),
                  (dtrain_traj, 'train_traj')]

## Predicting Duration

The final parameter used in the Stage 2 for XGboost is depicted as follows

In [52]:
param = { 'objective' : "reg:linear", 
          'booster' : "gbtree",
          'eta'                 :0.05, 
          'max_depth'           :12, 
          'colsample_bytree'    : 0.7,
          'subsample' : 0.7,
          'gamma' : 1,
          'n_thread' : 8
        }

In [None]:
bst_dur = xgb.train(param, dtrain_full_dur, 
                    evals=[(dtrain_full_dur,
                            'train')], 
                num_boost_round = 2000, 
                    feval= rmpse, maximize = False)

The parameter for the Random Forest model is depicted as follows. In hindsight, I should have chosen 0.3 as the max features instead of `sqrt` as it chooses too little features in each note. Furthermore, even though in practice we do not limit the max depth of Random Forest, I need to do that to save time in the computation

In [None]:
rf_dur = RandomForestRegressor(max_depth = 22, 
                               max_features = 'sqrt',
                               n_estimators=2000, 
                               verbose = 10, n_jobs = -1,
                               criterion='mse')\
.fit(sX_full_train, Y_full_train_dur)

We will then use the prediction from the XGBoost and the random forest as the input for the Lasso Model. In this case, we will perform a Cross validated lasso model, where we try 100 different value of alphas and take the best alpha which have the lowest RMSPE loss

In [None]:
X_ensemble_dur = np.zeros((n_train, 2))
X_ensemble_dur[:,0] = bst_dur.predict(dtrain_full_dur)
X_ensemble_dur[:,1] = rf_dur.predict(sX_full_train)
lasso_dur = LassoCV().fit(X_ensemble_dur, Y_full_train_dur)

In [56]:
joblib.dump(bst_dur, 'bst_dur_stage2.pkl')
joblib.dump(rf_dur, 'rf_dur_stage2.pkl')
joblib.dump(lasso_dur, 'lasso_dur_stage2.pkl')

['lasso_dur_stage2.pkl']

## Predicting Trajlength

The parameters for predicting Trajlength is very similar. Compared to the model which predict duration, the only difference is that we fit the RF and XGB model against the trajlength training values instead of duration training values.

In [57]:
param = { 'objective' : "reg:linear", 
          'booster' : "gbtree",
          'eta'                 :0.05, 
          'max_depth'           :12, 
          'colsample_bytree'    : 0.7,
          'subsample' : 0.7,
          'gamma' : 1,
          'n_thread' : 8
        }

In [None]:
bst_traj = xgb.train(param, dtrain_full_traj,
                     evals=[(dtrain_full_traj, 'train')], 
                num_boost_round = 2000,
                     feval= rmpse, maximize = False)

In [None]:
rf_traj = \
RandomForestRegressor(max_depth = 22,
                      max_features = 'sqrt',
                      n_estimators=2000, verbose = 3,
                      n_jobs = -1, criterion='mse'\
                     ).fit(sX_full_train, Y_full_train_traj)

In [None]:
X_ensemble_traj = np.zeros((n_train, 2))
X_ensemble_traj[:,0] = bst_traj.predict(dtrain_full_traj)
X_ensemble_traj[:,1] = rf_traj.predict(sX_full_train)
lasso_traj = LassoCV().fit(X_ensemble_traj, Y_full_train_traj)

In [61]:
joblib.dump(bst_traj, 'bst_traj_stage2.pkl')
joblib.dump(rf_traj, 'rf_traj_stage2.pkl')
joblib.dump(lasso_traj, 'lasso_traj_stage2.pkl')

['lasso_traj_stage2.pkl']

## Hyperparameter Tuning

In this example, I will show you the hyperparameter tuning process to get the parameter values used in the RF and XGB model for both predicting duration and trajlength. Performing Gridsearch / RandomGridSearch + CV is not a good idea in this case because of covariates and observation dimension is extremely large. Therefore, I increase the complexity of the model by intuition - adjusting the hyperparameter slowly if there are a lot of bias / if it started to overfit

In [1]:
#bst_dur = xgb.train(param, dtrain_dur, evals=[(dtrain_dur, 'train')], 
                #num_boost_round = 2000, feval= rmpse, maximize = False)
#rf_dur = RandomForestRegressor(max_depth = 22, max_features = 'sqrt', n_estimators=2000, 
                                #verbose = 3, n_jobs = -1, criterion='mse').fit(sX_dur, Y_train_dur)

In [2]:
# X_train_ens = np.zeros((372137, 2))
# X_train_ens[:,0] = bst_dur.predict(dtrain_dur)
# X_train_ens[:,1] = rf_dur.predict(sX_train)
# X_val_ens = np.zeros((93035, 2))
# X_val_ens[:,0] = bst_dur.predict(dval_dur)
# X_val_ens[:,1] = rf_dur.predict(sX_val)
# print rmpse_loss(lasso_dur, X_train_ens, Y_train_dur)
# print rmpse_loss(lasso_dur, X_val_ens, Y_val_dur)

In [3]:
#bst_traj = xgb.train(param, dtrain_traj, evals=[(dtrain_traj, 'train')], 
#                num_boost_round = 2000, feval= rmpse, maximize = False)
#rf_traj = RandomForestRegressor(max_depth = 22, max_features = 'sqrt', n_estimators=2000, 
 #                               verbose = 3, n_jobs = -1, criterion='mse').fit(sX_train, Y_train_traj)

In [4]:
# X_train_ens = np.zeros((372137, 2))
# X_train_ens[:,0] = bst_traj.predict(dtrain_traj)
# X_train_ens[:,1] = rf_traj.predict(sX_train)
# X_val_ens = np.zeros((93035, 2))
# X_val_ens[:,0] = bst_traj.predict(dval_traj)
# X_val_ens[:,1] = rf_traj.predict(sX_val)
# print rmpse_loss(lasso_traj, X_train_ens, Y_train_traj)
# print rmpse_loss(lasso_traj, X_val_ens, Y_val_traj)

In [5]:
# X_val_ens = np.zeros((93035, 2))
# X_val_ens[:,0] = bst_dur.predict(dval_dur)
# X_val_ens[:,1] = rf_dur.predict(sX_val)
# Y_val_dur_pred = lasso_dur.predict(X_val_ens)

In [6]:
# X_val_ens = np.zeros((93035, 2))
# X_val_ens[:,0] = bst_traj.predict(dval_traj)
# X_val_ens[:,1] = rf_traj.predict(sX_val)
# Y_val_traj_pred = lasso_traj.predict(X_val_ens)

In [7]:
# Y_val_pri_pred = np.exp(Y_val_dur_pred) + np.exp(Y_val_traj_pred)

In [8]:
# Y_val_pri_pred

In [9]:
# np.exp(Y_val_pri)

In [107]:
# rmpse_loss_func(np.exp(Y_val_pri), Y_val_pri_pred)

0.20196613102493846

# Combine

Lastly, we can fit the test dataset to get our prediction in Stage 2 for both duration and trajlength, using the model that we have fit using the training data. We can then save this prediction for the third stage

In [None]:
X_test_ens = np.zeros((n_test, 2))
X_test_ens[:,0] = bst_dur.predict(dtest)
X_test_ens[:,1] = rf_dur.predict(sX_test)
Y_test_dur_pred = lasso_dur.predict(X_test_ens)

In [None]:
X_test_ens = np.zeros((n_test, 2))
X_test_ens[:,0] = bst_traj.predict(dtest)
X_test_ens[:,1] = rf_traj.predict(sX_test)
Y_test_traj_pred = lasso_traj.predict(X_test_ens)

In [64]:
Y_test_pri_pred = np.exp(Y_test_dur_pred) \
+ np.exp(Y_test_traj_pred)

In [65]:
test_id = pd.read_csv("test.csv").ID.values

In [66]:
data = {'ID': test_id,
       'PRICE': Y_test_pri_pred}
submission_df = pd.DataFrame(data = data)
submission_df.to_csv("stage_2_v1.csv", index=False)

In [68]:
joblib.dump(Y_test_dur_pred, 'Y_dur_stage2.pkl')
joblib.dump(Y_test_traj_pred, 'Y_traj_stage2.pkl')

['Y_traj_stage2.pkl']