# Rossman Sales Model Execution

This notebook allows you to run predictive analysis on Rossman Store data in order to predict sales for any given store on any given day.

You should already have a cleaned dataset from completing the `Cleaning and Feature Engineering` notebook. If you do not have this, please go back and complete that notebook before using this one.

This notebook creates predictions using an ensemble of the following three models:

* Random Forest
* Multivariate Regression
* Gradient Boosted Trees

So, without further adieu, let's get to it!

In [84]:
import numpy as np
import pandas as pd

First - let's load up the dataset you created using the `Cleaning and Feature Engineering` notebook. We'll refer to this dataset as `cleaned_rossman_data`.

In [86]:
cleaned_rossman_data = pd.read_csv('cleaned_rossman_test_data.csv')

Next, we need to split this dataset into our X and Y sets:

In [87]:
cleaned_rossman_data.head()

Unnamed: 0.1,Unnamed: 0,Store,DayOfWeek,day,month,year,Sales,Customers,Open,Promo,...,storeType_a,storeType_b,storeType_c,storeType_d,Assortment_a,Assortment_b,Assortment_c,public_holiday,easter,christmas
0,0,173,5,17,5,2013,9296,0.151387,1,1,...,1,0,0,0,1,0,0,0,0,0
1,1,174,5,17,5,2013,6701,0.067057,1,1,...,1,0,0,0,1,0,0,0,0,0
2,2,175,5,17,5,2013,6349,0.09562,1,1,...,0,0,1,0,1,0,0,0,0,0
3,3,176,5,17,5,2013,6171,0.09086,1,1,...,1,0,0,0,1,0,0,0,0,0
4,4,177,5,17,5,2013,4391,0.069777,1,1,...,1,0,0,0,1,0,0,0,0,0


In [88]:
cleaned_rossman_data.drop("Unnamed: 0", axis=1, inplace=True)
Y = cleaned_rossman_data.Sales.to_frame()
X = cleaned_rossman_data
X.drop('Sales',axis=1,inplace=True)
final_test_x = X.to_numpy()
final_test_y = Y.to_numpy()

We won't know if we do a good job unless we measure our results - so let's bring in our measuring stick:

In [89]:
def metric(preds, actuals):
    preds = preds.reshape(-1)
    actuals = actuals.reshape(-1)
    assert preds.shape == actuals.shape
    return 100 * np.linalg.norm((actuals - preds) / actuals) / np.sqrt(preds.shape[0])

Now, let's bring in the artillery. Time to load our models.

Please note that the Random Forest model was too large to load to GitHub. As a result it is currently excluded from the reproducibility walkthrough.

In [90]:
import pickle

#rossman_random_forest = pickle.load(open(filename, 'rb'))
rossman_multivariate_regression = pickle.load(open('Models/multivar.pkl', 'rb'))
rossman_gradient_boosted = pickle.load(open('Models/gradient_boosted_model.pkl', 'rb'))



Let's make those predictions!

In [91]:
#y_pred_random_forest = rossman_random_forest.predict(final_test_x)
y_pred_multivariate_regression = rossman_multivariate_regression.predict(final_test_x)
y_pred_gradient_boosted = rossman_gradient_boosted.predict(final_test_x)

How did each model perform?

In [93]:
#metric(y_pred_random_forest, final_test_y)
print(metric(y_pred_multivariate_regression, final_test_y))
print(metric(y_pred_gradient_boosted, final_test_y))

30.950647493179606
28.47331573070836


Ok - so we have three Y hat prediction vectors. We are going to combine these with a weighted average:

In [115]:
y_pred_final =  (0.3 * y_pred_multivariate_regression) + (0.7 * y_pred_gradient_boosted) #+ (0.3 * y_pred_random_forest)

Annnnnd, how did we do:

In [116]:
metric(y_pred_final, final_test_y)

27.8487071966977