## Train and dev partition  for  Model selection 

We divide the data into train and dev and temporarily forget about the test to select the best model. <br>
We test six different models. Some of them are models on the same basis and give similar results, but it is important for us to compare them and choose the best one.

In this chapter, we choose the best model  to predict the number of visitors.

In [1]:
# Load libraries
import numpy as np 
import pandas as pd 
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import glob, re
from sklearn import *
from datetime import datetime
from xgboost import XGBRegressor

In [3]:
np.random.seed(10)

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

### 1.Data import 

In [5]:
# Data Aggregation
train = pd.read_csv('train.csv')

In [6]:
train.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,visit_date,visitors,air_store_id,latitude,longitude,month,date,dw,...,Ōsaka-fu,Hyōgo-ken,Hokkaidō,Shizuoka-ken,Fukuoka-ken,Hiroshima-ken,Niigata-ken,Miyagi-ken,reserve_visitors_air_1,air_date_diff_1
0,0,0,2016-01-13,25,air_ba937bf13d40fb24,35.658068,139.751599,1,13,2,...,0,0,0,0,0,0,0,0,,
1,1,1,2016-01-13,21,air_25e9888d30b386df,35.626568,139.725858,1,13,2,...,0,0,0,0,0,0,0,0,,
2,2,2,2016-01-13,40,air_fd6aac1043520e83,35.658068,139.751599,1,13,2,...,0,0,0,0,0,0,0,0,,
3,3,3,2016-01-13,5,air_64d4491ad8cdb1c6,35.658068,139.751599,1,13,2,...,0,0,0,0,0,0,0,0,,
4,4,4,2016-01-13,16,air_5c65468938c07fa5,35.661777,139.704051,1,13,2,...,0,0,0,0,0,0,0,0,,


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247009 entries, 0 to 247008
Data columns (total 60 columns):
Unnamed: 0                      247009 non-null int64
Unnamed: 0.1                    247009 non-null int64
visit_date                      247009 non-null object
visitors                        247009 non-null int64
air_store_id                    247009 non-null object
latitude                        247009 non-null float64
longitude                       247009 non-null float64
month                           247009 non-null int64
date                            247009 non-null int64
dw                              247009 non-null int64
dy                              247009 non-null int64
holiday_flg                     247009 non-null int64
sunday                          247009 non-null int64
saturday                        247009 non-null int64
sat/sun/hol                     247009 non-null float64
precipitation                   247009 non-null float64
avg_temperature

In [10]:
train = train.drop(['Unnamed: 0' , 'Unnamed: 0.1'], axis=1)

### 2. Train test partition

In [11]:
col = [c for c in train if c not in [ 'air_store_id', 'visit_date','visitors']]

In [12]:
train = train.fillna(-1)

In [13]:
for c, dtype in zip(train.columns, train.dtypes):
    if dtype == np.float64:
        train[c] = train[c].astype(np.float32)

#### To test and select models, we divide the training date set into two parts. <br>
To maintain balance, we are just like the train and the test is divided by date to dev, and separate the equal time interval.

In [14]:
X=train[col]
X_train = train[train.visit_date<'2017-03-01'][col]
X_dev = train[train.visit_date>'2017-03-01'][col]

y_train = np.log1p(train[train.visit_date<'2017-03-01']['visitors'].values)
y_dev = np.log1p(train[train.visit_date>'2017-03-01']['visitors'].values)

In [15]:
def RMSLE(y, pred):
    return metrics.mean_squared_error(y, pred)**0.5

In [16]:
model1 = ensemble.GradientBoostingRegressor(learning_rate=0.2, random_state=3, n_estimators=200, subsample=0.8, 
                      max_depth =10)

In [17]:
model2 = neighbors.KNeighborsRegressor(n_jobs=-1, n_neighbors=4)

In [18]:
model3 = XGBRegressor(learning_rate=0.2, n_estimators=280, subsample=0.8, 
                      colsample_bytree=0.8, max_depth =12)

In [20]:
from sklearn.ensemble import RandomForestRegressor

In [21]:
model4 = RandomForestRegressor(max_depth=2, random_state=0)

In [22]:
model5 = ensemble.ExtraTreesRegressor(n_estimators=225, max_depth=5, n_jobs=-1, random_state=3)

In [23]:
model6 = linear_model.LinearRegression(n_jobs=-1)

In [25]:
model1.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.2, loss='ls', max_depth=10, max_features=None,
             max_leaf_nodes=None, min_impurity_decrease=0.0,
             min_impurity_split=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             n_estimators=200, n_iter_no_change=None, presort='auto',
             random_state=3, subsample=0.8, tol=0.0001,
             validation_fraction=0.1, verbose=0, warm_start=False)

In [26]:
model2.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=-1, n_neighbors=4, p=2,
          weights='uniform')

In [27]:
model3.fit(X_train, y_train)

XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0, learning_rate=0.2, max_delta_step=0, max_depth=12,
       min_child_weight=1, missing=None, n_estimators=280, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.8)

In [29]:
model4.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=2,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [30]:
model5.fit(X_train, y_train)

ExtraTreesRegressor(bootstrap=False, criterion='mse', max_depth=5,
          max_features='auto', max_leaf_nodes=None,
          min_impurity_decrease=0.0, min_impurity_split=None,
          min_samples_leaf=1, min_samples_split=2,
          min_weight_fraction_leaf=0.0, n_estimators=225, n_jobs=-1,
          oob_score=False, random_state=3, verbose=0, warm_start=False)

In [31]:
model6.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False)

In [32]:
preds1 = model1.predict(X_train)

In [33]:
preds2 = model2.predict(X_train)

In [34]:
preds3 = model3.predict(X_train)

In [35]:
preds4 = model4.predict(X_train)

In [36]:
preds5 = model5.predict(X_train)

In [37]:
preds6 = model6.predict(X_train)

In [39]:
print('RMSE GradientBoostingRegressor: ', RMSLE(y_train, preds1))

('RMSE GradientBoostingRegressor: ', 0.5999133972950422)


In [41]:
print('RMSE KNeighborsRegressor: ', RMSLE(y_train, preds2))

('RMSE KNeighborsRegressor: ', 0.6653271669113268)


In [44]:
print('RMSE XGBRegressor: ', RMSLE(y_train, preds3))

('RMSE XGBRegressor: ', 0.5703191647257186)


In [46]:
print('RMSE RandomForestRegressor: ', RMSLE(y_train, preds4))

('RMSE RandomForestRegressor: ', 0.7745519708809019)


In [47]:
print('RMSE ExtraTreesRegressor: ', RMSLE(y_train, preds5))

('RMSE ExtraTreesRegressor: ', 0.7622688887189537)


In [48]:
print('RMSE LinearRegression: ', RMSLE(y_train, preds6))

('RMSE LinearRegression: ', 0.7581474557693982)


We see in this section the smallest error using the model: XGBoost equal to 0.570319