# Rossman Sales Prediction

Task: Predict 6 weeks of daily sales for 1,115 stores located across Germany.  
Data: https://www.kaggle.com/c/rossmann-store-sales

Situation: Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers 
    are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by 
    many factors, including promotions, competition, school and state holidays, seasonality, and locality. 

Desired Outcome: Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 

Action: Build a supervised predictive models and compare them to find the best model with least RMSE.

### Approach
1) Load relavent libraries
2) Import data
3) Data cleaning
4) Data transformation
5) Feature Engineering
6) Predictive Models
7) Model Selection
8) Hyper-parameter Tuning
9) Evaluation

### Future improvements

1) RandomSearch CV - Instead of using grid search alone, we can use random search to find the closest best parameter and then use the grid search to find the best hyperparameter setting

### Modelling

##### Import relevant libraries

In [None]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
from math import sqrt
import math
import subprocess
import sklearn.preprocessing as preprocessing

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler

from sklearn import linear_model
from sklearn import tree
from sklearn import neighbors, datasets
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import sklearn.ensemble as es
import xgboost as xgb
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, explained_variance_score
from sklearn.metrics import r2_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import explained_variance_score

### Import Data

In [None]:
train_input = pd.read_csv('train.csv', parse_dates=[2])
test_input  = pd.read_csv('test.csv',parse_dates=[3])
store = pd.read_csv('store.csv')

##### Check dimensions of the input dataset


In [None]:
print('test_input ', test_input.shape) #test  (41088, 8)
print('train_input ',train_input.shape) #train  (1017209, 9)

### Data cleaning and Data transformation

##### Checking for Nulls in the dataset

In [None]:
print('Store variable null counts:\n',store.isnull().sum()) 

##### We see that there exists a lot of Nulls in the dataset which needs to be treated. We will replace the Null values with zeros.

In [None]:
store.fillna(0,inplace = True)

##### Merge store level data with the test and train data


In [None]:
train_df = pd.merge(train_input,store,on='Store')
test_df = pd.merge(test_input,store,on='Store')

##### Check dimensions of the merged dataset


In [None]:
print('test ', test_df.shape) #test  (41088, 17)
print('train ',train_df.shape) #train  (1017209, 18)

##### Checking for Nulls in the combined dataset.We see nulls in theopen column which we will have to treat


In [None]:
print('Test variable null counts:\n',test_df.isnull().sum()) 
print('\nTrain variable null counts:\n', train_df.isnull().sum()) 

##### Dropping columns that are not required as predictor variables


In [None]:
train_df_dropped = train_df.copy().drop(['Customers'],axis=1).fillna(1)
test_df_dropped = test_df.copy().drop(["Id"],axis=1).fillna(1)

###### Subset for open days and with store sales > 0


In [None]:
train_df_dropped = train_df_dropped[(train_df_dropped.Open != 0)&(train_df_dropped.Sales >0)]

### Feature Engineering

##### The below function returns the week of the month for the specified date

In [None]:
def week_of_month(dt):
    """.
    """
    first_day = dt.replace(day=1)
    dom = dt.day
    adjusted_dom = dom + first_day.weekday()
    return int(math.ceil(adjusted_dom/7.0))

##### Creating date variables for train and test(Month, Year, Day, Week, WeekOfMonth)


In [None]:
train_df_dropped['Month'] = train_df_dropped.Date.dt.month_name()
train_df_dropped['Year'] = train_df_dropped.Date.dt.year
train_df_dropped['Day'] = train_df_dropped.Date.dt.day_name()
train_df_dropped['Week'] = train_df_dropped.Date.dt.week
train_df_dropped['WeekOfMonth'] = train_df_dropped.Date.apply(week_of_month)

In [None]:
test_df_dropped['Month'] = test_df_dropped.Date.dt.month_name()
test_df_dropped['Year'] = test_df_dropped.Date.dt.year
test_df_dropped['Day'] = test_df_dropped.Date.dt.day_name()
test_df_dropped['Week'] = test_df_dropped.Date.dt.week
test_df_dropped['WeekOfMonth'] = test_df_dropped.Date.apply(week_of_month)

###### Calculating promo2 open time in months


In [None]:
train_df_dropped['PromoOpen'] = 12 * (train_df_dropped['Year'] - train_df_dropped.Promo2SinceYear) + \
                                 (train_df_dropped.Week - train_df_dropped.Promo2SinceWeek) / 4.0
train_df_dropped['PromoOpen'] = train_df_dropped['PromoOpen'].apply(lambda x: x if x > 0 else 0)

In [None]:
test_df_dropped['PromoOpen'] = 12 * (test_df_dropped['Year'] - test_df_dropped.Promo2SinceYear) + \
                                (test_df_dropped.Week - test_df_dropped.Promo2SinceWeek) / 4.0
test_df_dropped['PromoOpen'] = test_df_dropped['PromoOpen'].apply(lambda x: x if x > 0 else 0)

##### Calculating the monthly,weekly and daily avg sales historically


In [None]:
#Monthly Average
def monthAvg(month):
    return train_df_dropped.groupby(['Month'])['Sales'].mean()[month]

#Daily Average
def dayAvg(day,week):
    return train_df_dropped.groupby(['Week', 'Day'])['Sales'].mean()[week,day]

#Week Avg
def weekAvg(week):
    return train_df_dropped.groupby(['Week'])['Sales'].mean()[week]

train_df_dropped['month_sales_avg'] = train_df_dropped.apply(lambda x: monthAvg(x['Month']),axis = 1)
train_df_dropped['day_sales_avg'] = train_df_dropped.apply(lambda x: dayAvg(x['Day'],x['Week']),axis = 1)
train_df_dropped['week_sales_avg'] = train_df_dropped.apply(lambda x: weekAvg(x['Week']),axis = 1)

test_df_dropped['month_sales_avg'] = test_df_dropped.apply(lambda x: monthAvg(x['Month']),axis = 1)
test_df_dropped['day_sales_avg'] = test_df_dropped.apply(lambda x: dayAvg(x['Day'],x['Week']),axis = 1)
test_df_dropped['week_sales_avg'] = test_df_dropped.apply(lambda x: weekAvg(x['Week']),axis = 1)

##### Calculating the monthly,weekly and daily avg sales based on store type


In [None]:
#Monthly Average
def monthAvgStore(StoreType, month):
    return train_df_dropped.groupby(['StoreType','Month'])['Sales'].mean()[StoreType,month]

#Daily Average
def dayAvgStore(StoreType,day,week):
    return train_df_dropped.groupby(['StoreType','Week', 'Day'])['Sales'].mean()[StoreType,week,day]

#Week Avg
def weekAvgStore(StoreType,week):
    return train_df_dropped.groupby(['StoreType','Week'])['Sales'].mean()[StoreType,week]

train_df_dropped['month_sales_avg_store'] = train_df_dropped.apply(lambda x: monthAvgStore(x['StoreType'],x['Month']),axis = 1)
train_df_dropped['day_sales_avg_store'] = train_df_dropped.apply(lambda x: dayAvgStore(x['StoreType'],x['Day'],x['Week']),axis = 1)
train_df_dropped['week_sales_avg_store'] = train_df_dropped.apply(lambda x: weekAvgStore(x['StoreType'],x['Week']),axis = 1)

test_df_dropped['month_sales_avg_store'] = test_df_dropped.apply(lambda x: monthAvgStore(x['StoreType'],x['Month']),axis = 1)
test_df_dropped['day_sales_avg_store'] = test_df_dropped.apply(lambda x: dayAvgStore(x['StoreType'],x['Day'],x['Week']),axis = 1)
test_df_dropped['week_sales_avg_store'] = test_df_dropped.apply(lambda x: weekAvgStore(x['StoreType'],x['Week']),axis = 1)


##### Calculating the monthly avg sales historically based on store

In [None]:
#Monthly Average
def monthAvgSt(Store, month):
    return train_df_dropped.groupby(['Store','Month'])['Sales'].mean()[Store,month]

#Daily Average
def dayAvgSt(Store,day,week):
    return train_df_dropped.groupby(['Store','Week', 'Day'])['Sales'].mean()[Store,week,day]

#Week Avg
def weekAvgSt(Store,week):
    return train_df_dropped.groupby(['Store','Week'])['Sales'].mean()[Store,week]

train_df_dropped['month_sales_avg_st'] = train_df_dropped.apply(lambda x: monthAvgSt(x['Store'],x['Month']),axis = 1)
train_df_dropped['day_sales_avg_st'] = train_df_dropped.apply(lambda x: dayAvgSt(x['Store'],x['Day'],x['Week']),axis = 1)
train_df_dropped['week_sales_avg_st'] = train_df_dropped.apply(lambda x: weekAvgSt(x['Store'],x['Week']),axis = 1)

test_df_dropped['month_sales_avg_st'] = test_df_dropped.apply(lambda x: monthAvgSt(x['Store'],x['Month']),axis = 1)
test_df_dropped['day_sales_avg_st'] = test_df_dropped.apply(lambda x: dayAvgSt(x['Store'],x['Day'],x['Week']),axis = 1)
test_df_dropped['week_sales_avg_st'] = test_df_dropped.apply(lambda x: weekAvgSt(x['Store'],x['Week']),axis = 1)

##### For hold-out evaluation, we are split the last 6 weeks of train dataset as hold-out set or validation set


In [None]:
train_sorted = train_df_dropped.sort_values(['Date'],ascending = False)
train_df = train_sorted.copy()
split_index = 6*7*1115
valid_df = train_df[:split_index] 
train_final = train_df[split_index:]

##### Drop variables to create the X variables to predict. We need to drop the store variable as there are too many stores to create dummy variables as this is computationally expensive.

In [None]:
valid = valid_df.drop(['Sales',"Date",'Store'],axis=1)
train = train_df.drop(['Sales',"Date",'Store'],axis=1)
test = test_df_dropped.drop(["Date",'Store'],axis=1)

In [None]:
#Check the datatypes of the final columns to check for any the categorical columns
train_final.dtypes

##### Combining train,validation and test dataset to convert categorical variables to dummy variables

In [None]:
train_objs_num = len(train)
valid_objs_num = len(valid)
dataset = pd.concat(objs=[train, valid, test], axis=0)

##### Transform non numeric columns to dummies


In [None]:
cols_to_transform = [  'DayOfWeek' ,'StoreType','Assortment',  'PromoInterval','StateHoliday',  
                     'Month', 'Day', 'Week','WeekOfMonth']
dataset = pd.get_dummies(dataset ,columns = cols_to_transform )

##### Transforming the dataset back to train, validation and test


In [None]:
train_final = (dataset[:train_objs_num])
valid_final = (dataset[train_objs_num:valid_objs_num+train_objs_num])
test_final = (dataset[valid_objs_num+train_objs_num:])

##### Combining train and validation dataset for Nested Cross Validation


In [None]:
train_final_total = (dataset[:valid_objs_num+train_objs_num])
train_final_total_sales = pd.concat(objs=[train_df['Sales'], valid_df['Sales']], axis=0)

##### Creating target variables for train and test
##### Taking log transformation of the predictor variable as the sales vary a lot

In [None]:
y_train =  np.log1p(train_df['Sales'])
y_valid =  np.log1p(valid_df['Sales'])

### Predictive Models

##### Define evaluation metrics - based on the problem submissions the predictions are evaluated on the Root Mean Square Percentage Error (RMSPE).


In [None]:
def rmspe(y, yhat):
    return np.sqrt(np.mean((yhat/y-1) ** 2))

##### To determine which model to use, we compare the crossvalidation score of all the algorithms. We select the algorithm with least score and then we train the hyperparameters for this model to get the best predictions.

###### We will compare Decision tree regression, Knn regression, SVR, Random Forest regression and XgBoost regression

### Model selection

Before we run all the models we use a nested cross validation to select the best performing model. Creating a common function to calculate the nested cross validation score so we can chose our model and tune the hperparameters for that model.

In [None]:
def model_sel(alg, tuned_parameters, n1, n2, X, y):
    inner_cv = KFold(n_splits=n1, shuffle=True,  random_state= 5)
    outer_cv = KFold(n_splits=n2, shuffle=True,  random_state= 5)
    clf = GridSearchCV(alg, tuned_parameters, cv = inner_cv)
    print(clf.estimator)
    nested_score = cross_val_score(clf, X, y, cv = outer_cv, scoring = 'mean_squared_error')
    nested_score = np.sqrt(np.abs(nested_score))
    RMSE = nested_score.mean()
    STD = nested_score.std()
    return RMSE, STD

We will use the results from above functions to select the model. In this example I i will tune hyperparameters for all the models. But the best way would be to choose a model and then tune hyperparameters for the best model.

### Model 1 - Decision Tree Regression

In [None]:
# Gridsearch and fitting the model
from sklearn.tree import DecisionTreeRegressor
# Hyper parameters of decision tree
params = {'max_depth': np.arange(3, 10),
                 'min_samples_split': np.linspace(0.01, 0.3, 30, endpoint=True), 
                 'min_samples_leaf': np.linspace(0.01, 0.15, 15, endpoint=True)}
#Defining the model
reg = DecisionTreeRegressor()
#GridsearchCV to tune hyperparameters and get the best predictions 
grid_tree = GridSearchCV(reg, params, cv = 10, refit = True)
#Training the model on train data
grid_tree.fit(train_final,train_df['Sales'])
#Predicting the sales on the validation dataset
y_pred = grid_tree.predict(valid_final)
#Evaluating the RMSE value
rmspe(valid_df['Sales'],y_pred)

In [None]:
# nested crossvalidation score
tree_RMSE, tree_STD = model_sel(reg, params, 5, 5, train_final_total, train_final_total_sales)

### Model 2 - K-NN Regression

##### We need to normalize the x variables to implement Knn as this is a distance based algorithm and using pipeline to achieve that.

In [None]:
# Normalizing
scaler = MinMaxScaler(feature_range=(0,1))
X_train = scaler.fit_transform(train_final)
X_valid = scaler.fit_transform(valid_final)
X_test = scaler.fit_transform(test_final)

# build regression pipeline
pipeline = Pipeline([('normalize', Normalizer()),
                     ('kbest', SelectKBest(f_classif)),
                     ('regressor', KNeighborsRegressor())])
knns = KNeighborsRegressor()
# try knn__n_neighbors from 1 to 20, and feature count from 1 to len(features)
parameters = {'kbest__k':  list(range(1, X_train.shape[1]+1)),
              'regressor__n_neighbors': list(range(1,11))}
grid = GridSearchCV(pipeline, parameters, cv=10, scoring="neg_mean_squared_error")
grid.fit(X_train, train_df['Sales'])

y_pred = grid.predict(X_valid)
rmspe(valid_df['Sales'],y_pred)

y_pred = rf.predict(X_test)


In [None]:
# nested crossvalidation score
knn_RMSE, knn_STD = model_sel(pipeline, parameters, 5, 5, train_final_total, train_final_total_sales)

### Model 3 - Random Forest

In [None]:
import sklearn.ensemble as es
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rf = es.RandomForestRegressor()
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = 10, verbose=10, n_jobs = -1)
rf_grid.fit(train_final,train_df['Sales'])
y_pred = rf_grid.predict(valid_final)
rmspe(valid_df['Sales'],y_pred)

y_pred = rf_grid.predict(test_final)

In [None]:
# nested crossvalidation score
rf_RMSE, rf_STD = model_sel(rf, grid, 5, 5, train_final_total, train_final_total_sales)

### Model 4 - Neural Network

In [None]:
def base_model():
    model = Sequential()
    model.add(Dense(200, input_dim=n_cols, kernel_initializer='normal',activation='relu'))
    model.add(Dense(100, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mean_squared_error', optimizer='adamax')
    return model

model = KerasRegressor(build_fn=base_model, verbose=0)

## Tuning Hyper -Parameter
# Optimizing for Batch Size and Epochs
batch_size = [5, 20, 40, 100]
epochs = [10, 50, 100,300]

#get number of columns in training data
n_cols = train_final.shape[1]

param_grid = dict(batch_size=batch_size, epochs=epochs)
grid = GridSearchCV(estimator=model, param_grid=param_grid,
                    scoring='neg_mean_squared_error',n_jobs=1,cv=10)
grid_result = grid.fit(train_final,train_df['Sales'])

In [None]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


In [None]:
# Optimizing for Size of first and second layers
def base_model(l1=200,l2=100):
    model = Sequential()
    model.add(Dense(l1, input_dim=n_cols, kernel_initializer='normal' ,activation='relu'))
    model.add(Dense(l2, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mean_squared_error', optimizer='Adamax')
    return model

model = KerasRegressor(build_fn=base_model, verbose=0, epochs=10, batch_size=5)
l1 = [60, 200, 500, 600, 1000]
l2 = [100, 250 ,400, 500, 600, 700]
param_grid = dict(l1=l1, l2=l2)
grid = GridSearchCV(estimator=model, param_grid=param_grid,scoring='neg_mean_squared_error')
grid_result = grid.fit(train_final,train_df['Sales'])

means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


In [None]:
# Final model on best hyper parameter setting
def final_model():
    model = Sequential()
    model.add(Dense(1000, input_dim=n_cols, kernel_initializer='normal' ,activation='relu'))
    model.add(Dense(600, kernel_initializer='normal',activation='relu',))
    model.add(Dense(1, kernel_initializer='normal'))
    model.compile(loss='mean_squared_error', optimizer='Adamax')
    return model

reg = KerasRegressor(build_fn=final_model, epochs=300, batch_size=40,verbose=1,validation_split=0.2)
kfold = KFold(n_splits=5, random_state=1)
results = np.sqrt(-1*cross_val_score(reg, train_final,train_df['Sales'],scoring= "neg_mean_squared_error", cv=kfold))
print("Training RMSE mean and std from CV: {} {}".format(results.mean(),results.std()))

# Prediction Evalution on testing data
reg.fit(train_final,train_df['Sales'])
y_pred = reg.predict(train_final)

y_pred = reg.predict(valid_final)
rmspe(valid_df['Sales'],y_pred)

y_pred = reg.predict(test_final)

### Feature selection 

In [None]:
#Feature selection using Random Forest
clf = RandomForestRegressor(n_estimators=100, random_state=1)
clf.fit(train_final,train_df['Sales'])
feature_importance = pd.DataFrame(np.round(clf.feature_importances_,3),
                        index = X_train.columns,
                        columns=['importance']).sort_values('importance',
                        ascending = False)

#Features are ranked based on how they reduce imprurity and create pure nodes.
#Based on the ranking, we decide to pick up the top 10 features
#which have importance greater than 0.009.

#Based on the values, we take the top 10 features and compare how 
#performance differs.
top_features = feature_importance.nlargest(10,"importance")
top_features .T.plot(kind = "bar")

# Selecting top features for training
scaler = MinMaxScaler()
x_train_scaled_fs = scaler.fit_transform(X_train.filter(items = top_features.index))
x_valid_scaled_fs = scaler.transform(X_test.filter(items = top_features.index))
x_test_scaled_fs = scaler.transform(X_test.filter(items = top_features.index))

### SVR

In [None]:
# Model 5 - SVR

#Grid Search SVM Regression
parameters = {'kernel':['linear','rbf','poly'], 'C':[0.1, 1, 10, 100, 1000],
              'gamma': [0.1, 1, 10],'degree': [0, 1, 2]}
clf = GridSearchCV(SVR(), parameters,cv= 10, verbose = 10,scoring='neg_mean_squared_error',refit=True,n_jobs=-1)
clf.fit(x_train_scaled_fs,train_df['Sales']) 
clf.best_params_
y_pred = clf.predict(x_valid_scaled_fs)
rmspe(valid_df['Sales'],y_pred)
# Five submission attempt
y_pred = clf.predict(x_test_scaled_fs)

In [None]:
# nested crossvalidation score
svr_RMSE, svr_STD = model_sel(SVR(), parameters, 5, 5, train_final_total, train_final_total_sales)

### GBM

In [None]:
# Model 6 - GBM
# Setting the hyperparameters by cross-validation
gbm_parameters = [{'n_estimators': [100,500,1000],
                    'max_depth': [6,10,14],
                    'max_features': [4,8,12],
                    'min_samples_leaf': [5,10,15]}]

reg = GridSearchCV(GradientBoostingRegressor(), gbm_parameters, cv=10, scoring = 'neg_mean_squared_error')
reg.fit(train_final,train_df['Sales'])

print("Best parameters set found on development set:")
print(reg.best_params_)

print("Best score found on development set:")
print(reg.best_score_)

y_pred = reg.predict(valid_final)
rmspe(valid_df['Sales'],y_pred)

In [None]:
# nested crossvalidation score
gbm_RMSE, gbm_STD = model_sel(GradientBoostingRegressor(), gbm_parameters, 5, 5, train_final_total, train_final_total_sales)

### AdaBoost

In [None]:
# Model 7 - Adaboost
# Setting the hyperparameters by cross-validation
ab_parameters = [{'n_estimators': [10,50,100,200],
                    'learning_rate': [0.01,0.1,0.3]}]

reg = GridSearchCV(AdaBoostRegressor(), ab_parameters, cv=10, scoring='neg_mean_squared_error')
reg.fit(X_train, y_train)

print("Best parameters set found on development set:")
print(reg.best_params_)

print("Best score found on development set:")
print(reg.best_score_)


# Final Adaboost model accuracy using the hyper-parameters obtained from grid search
y_pred = reg.predict(X_test)
reg.score(X_test,y_test) 

In [None]:
# nested crossvalidation score
adb_RMSE, adb_STD = model_sel(AdaBoostRegressor(), ab_parameters, 5, 5, train_final_total, train_final_total_sales)

### XgBoost

In [None]:

# Model 6 - XGBoost

# A parameter grid for XGBoost
xgb_model = XGBRegressor()
parameters = {'objective':['reg:linear'],
              'learning_rate': [0.02,0.03,0.04,0.05], #so called `eta` value
              'max_depth': [6,7,8,9,10,11,12,13],
              'subsample': [0.7,0.8,0.09,],
              'colsample_bytree': [0.7,0.8,0.9]}

clf = GridSearchCV(xgb_model, parameters, n_jobs=5, cv=10, verbose=2, refit=True)
clf.fit(train_final,train_df['Sales']) 
clf.best_params_
y_pred = clf.predict(valid_final)
rmspe(valid_df['Sales'],y_pred)

In [None]:
# nested crossvalidation score
xg_RMSE, xg_STD = model_sel(xgb_model, parameters, 5, 5, train_final_total, train_final_total_sales)

XgBoost found to be the best performing model compared all others. 

In [None]:
# For XGboost since we are using DMatrix as input we will need to do a custom loss function to ensure that we can calculate RMSPE
def rmspe_xg(yhat, y):
    y = np.expm1(y.get_label())
    yhat = np.expm1(yhat)
    return "rmspe", rmspe(y,yhat)

# XGboost converting to DMatrix and using num_boost_round on the final selected parameters

import xgboost as xgb

params = {"objective": "reg:linear",  
          "eta": 0.03,  
          "max_depth": 10,
          "subsample": 0.9,
          "colsample_bytree": 0.7
          }
num_boost_round = 4000

dtrain = xgb.DMatrix(train_final, y_train)
dvalid = xgb.DMatrix(valid_final, y_valid)
watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
xg_model = xgb.train(params, dtrain, num_boost_round, evals=watchlist, early_stopping_rounds= 100, feval=rmspe_xg, verbose_eval=True)

x_train_total = pd.concat(objs=[train_final, valid_final], axis=0)
y_train_total = pd.concat(objs=[y_train, y_valid], axis=0)

#Training the model on the entire data
dtrain = xgb.DMatrix(x_train_total, y_train_total)
dtest = xgb.DMatrix(test_final)
params = {"objective": "reg:linear",
          "booster" : "gbtree",   
          "eta": 0.03,   
          "max_depth": 10,  
          "subsample": 0.9, 
          "colsample_bytree": 0.7,        
          }
num_round = 1000
xg_model = xgb.train(params, dtrain, num_round)
# make predictionon test data
preds = xg_model.predict(dtest)


Note: Note that xgboost.train() will return a model from the last iteration, not the best one.